Python Programming/Internet Data

This lesson introduces Python Internet-based data processing, including web pages and email (HTML, XML, JSON, and SMTP).

Objectives and Skills edit

Objectives and skills for this lesson include:[1]

  • Standard Library
    • urllib and json modules

Readings edit

  1. Wikipedia: HTML
  2. Wikipedia: XML
  3. Wikipedia: JSON
  4. Wikipedia: Simple Mail Transfer Protocol
  5. Python for Everyone: Networked programs
  6. Python for Everyone: Using Web Services

Multimedia edit

  1. YouTube: Python for Informatics - Chapter 12 - Networked Programs
  2. YouTube Python for Informatics Chapter 13 - Web Services (Part 1/3)
  3. YouTube: Python for Informatics Chapter 13 - Web Services (Part 2/3)
  4. YouTube: Python for Informatics Chapter 13 - Web Services (Part 3/3)
  5. YouTube: Python - Downloading Files from the Web

Examples edit

The urllib.request.urlopen() Method edit

The urllib.request.urlopen() method opens the given URL. For HTTP and HTTPS URLs, it returns an http.client.HTTPResponse object that may be read like a file object.[2]

import urllib.request

url = "https://en.wikiversity.org/wiki/Python_Programming/Internet_Data"
try:
    page = urllib.request.urlopen(url).read().decode()
except Exception as exception:
    print(str(exception) + " reading " + url)
    exit(1)

for line in page.split("\n"):
    print(line)

Output:

<This page's source HTML...>

The xml.etree.ElementTree.fromstring() Method edit

The xml.etree.ElementTree.fromstring() method parses XML from a string directly into an XML Element, which is the root element of the parsed tree.[3]

import xml.etree.ElementTree

root = xml.etree.ElementTree.fromstring(page)

The xml.etree.ElementTree.ElementTree() Method edit

The xml.etree.ElementTree.ElementTree() method returns an ElementTree hierarchy for the given element.[4]

import xml.etree.ElementTree

tree = xml.etree.ElementTree.ElementTree(root)

The xml.etree.ElementTree.iter() Method edit

The xml.etree.ElementTree.iter() method returns an iterator that loops over all elements in the tree, in section order.[5]

import urllib.request
import xml.etree.ElementTree

url = "http://www.w3schools.com/xml/note.xml"
try:
    page = urllib.request.urlopen(url).read()
    page = page.decode("UTF-8")
except Exception as exception:
    print(str(exception) + " reading " + url)
    exit(1)

root = xml.etree.ElementTree.fromstring(page)
tree = xml.etree.ElementTree.ElementTree(root)

for element in tree.iter():
    print("%s: %s" % (element.tag, element.text))

Output:

<The XML elements in the page http://www.w3schools.com/xml/note.xml...>

The xml.etree.ElementTree.findall() Method edit

The xml.etree.ElementTree.findall() method finds only elements with a specific tag which are direct children of the current element.[6]

import urllib.request
import xml.etree.ElementTree

url = "http://www.w3schools.com/xml/note.xml"
try:
    page = urllib.request.urlopen(url).read()
    page = page.decode("UTF-8")
except Exception as exception:
    print(str(exception) + " reading " + url)
    exit(1)

root = xml.etree.ElementTree.fromstring(page)
tree = xml.etree.ElementTree.ElementTree(root)

for element in tree.findall("to"):
    print("%s: %s" % (element.tag, element.text))

Output:

<The XML "to" element(s) in the page http://www.w3schools.com/xml/note.xml...>

The json.loads() Method edit

The json.loads() method converts a given JSON string to a corresponding Python object (dict, list, string, etc.).[7] The Wikimedia Pageview API is documented at https://wikitech.wikimedia.org/wiki/Analytics/PageviewAPI.

import urllib.request
import json

title = ""
url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikiversity/all-access/user/" + \
    "Python_Programming%2fInternet_Data" + \
    "/daily/2016100100/2016103100"

try:
    page = urllib.request.urlopen(url).read().decode()
except Exception as exception:
    print(str(exception) + " reading " + url)
    exit(1)

print("Page Views")
dictionary = json.loads(page)
for item in dictionary["items"]:
    print("%s: %s" % (item["timestamp"], item["views"]))

Output:

<Page views for this page for 2016 October ...>

The smtplib Module edit

The smtplib module defines an SMTP client session object that can be used to send mail to any Internet machine with an SMTP or ESMTP listener daemon.[8]

import smtplib
 
server = "smtp.gmail.com"
port = 587
username = "username"
password = "password"

sender = "me@domain"
recipient = "you@domain"
subject = "Python Email Test"
message = "Hello from Python!"

try:
    smtp = smtplib.SMTP(server, port)
    smtp.starttls()
    smtp.login(username, password)
    smtp.sendmail(sender, recipient, 
        "From: %s\nTo: %s\nSubject: %s\n%s" % (sender, recipient, subject, message))
    smtp.quit()

    print("Sent message.")
except Exception as exception:
    print(exception)

Output:

Sent message.

Activities edit

Tutorials edit

  1. Complete one or more of the following tutorials:

Practice edit

  1. Create a Python program that asks the user for a URL that contains HTML tags, such as:
        <p><strong>This is a bold paragraph.</strong></p>
    Check for a URL parameter passed from the command line. If there is no parameter, ask the user to input a URL for processing. Verify that the URL exists and then use RegEx methods to search for and remove all HTML tags from the text, saving each removed tag in a dictionary. Print the untagged text and then use a function to display the list of removed tags sorted in alphabetical order and a histogram showing how many times each tag was used. Include error handling in case an HTML tag isn't entered correctly (an unmatched < or >). Use a user-defined function for the actual string processing, separate from input and output. For example:
        </p>: *
        </strong>: *
        <p>: *
        <strong>: *
  2. Create a Python program that reads XML data from http://www.w3schools.com/xml/simple.xml and builds a list of menu items, with each list entry containing a dictionary with fields for the item's name, price, description, and calories. After parsing the XML data, display the menu items in decreasing order by price similar to:
        name - description - calories - price
  3. Create a Python program that asks the user for a location (city and state, province, or country). Use Google's Geocoding API to look up and display the given location's latitude and longitude. Also display a URL that could be used to pinpoint the given location on a map. The output should be similar to:
        Location: <location>
        Latitude: <latitude>
        Longitude: <longitude>
        Map: https://www.google.com/maps/@<latitude>,<longitude>,15z
  4. Create a Python program that asks the user for a Wikiversity page title and the user's email address. Check for URL and email address parameters passed from the command line. If there are no parameters, ask the user to input a page title and email address for processing. Verify that the Wikiversity page exists, and then check the page to see when it was last modified. If it was modified within the last 24 hours, send the user an email message letting them know that the page was modified recently. Include a link to the Wikiversity page in the email message.

Lesson Summary edit

Internet Data Concepts edit

  • HyperText Markup Language (HTML) is the standard markup language for creating web pages and web applications.[9]
  • HTML describes the structure of a web page semantically and originally included cues for the appearance (layout) of the document.[10]
  • HTML elements are the building blocks of HTML pages.[11]
  • HTML elements are delineated by tags, written using angle brackets.[12]
  • Tags are typically written using the syntax <tag>content</tag>. [13]
  • Some tags are written using the syntax <tag content />.[14]
  • Browsers do not display the HTML tags, but use them to interpret the content of the page.[15]
  • Cascading Style Sheets (CSS) define the look and layout of content.[16]
  • The HTML start tag may also include attributes within the tag, using the syntax <tag attribute="value" ... >content</tag>.[17]
  • The style attribute may be used to embed CSS style inside HTML tags using the syntax <tag style="property:value; ...">content</tag>.[18]
  • HTML comments are written using the syntax <!-- comment -->.[19]
  • Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.[20]
  • Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services.[21]
  • Well-formed XML follows a syntax similar to HTML, using nested tags to represent data structure and values.[22]
  • JSON (JavaScript Object Notation) is an open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs.[23]
  • JSON is the most common data format used for asynchronous browser/server communication, largely replacing XML.[24]
  • Simple Mail Transfer Protocol (SMTP) is an Internet standard for electronic mail (email) transmission.[25]
  • Although electronic mail servers and other mail transfer agents use SMTP to send and receive mail messages, user-level client mail applications typically use SMTP only for sending messages to a mail server for relaying. For retrieving messages, client applications usually use either IMAP or POP3.[26]
  • SMTP communication between mail servers by default uses the TCP port 25. Mail clients often use port 587 to submit emails to the mail service. Despite being deprecated, the nonstandard port 465 is commonly used by mail providers.[27]
  • SMTP connections secured by SSL, known as SMTPS, can be made using STARTTLS.[28]

Python Internet Data edit

  • The urllib.request.urlopen() method opens the given URL. For HTTP and HTTPS URLs, it returns an http.client.HTTPResponse object that may be read like a file object.[29]
  • The xml.etree.ElementTree.fromstring() method parses XML from a string directly into an XML Element, which is the root element of the parsed tree.[30]
  • The xml.etree.ElementTree.iter() method returns an iterator that loops over all elements in the tree, in section order.[31]
  • The json.loads() method converts a given JSON string to a corresponding Python object (dict, list, string, etc.).[32]
  • The smtplib module defines an SMTP client session object that can be used to send mail to any Internet machine with an SMTP or ESMTP listener daemon.[33]

Key Terms edit

API
Application Program Interface - A contract between applications that defines the patterns of interaction between two application components.[34]
BeautifulSoup
A Python library for parsing HTML documents and extracting data from HTML documents that compensates for most of the imperfections in the HTML that browsers generally ignore. You can download the BeautifulSoup code from www.crummy.com.[35]
ElementTree
A built-in Python library used to parse XML data.[36]
JSON
JavaScript Object Notation. A format that allows for the markup of structured data based on the syntax of JavaScript Objects.[37]
port
A number that generally indicates which application you are contacting when you make a socket connection to a server. As an example, web traffic usually uses port 80 while email traffic uses port 25.[38]
scrape
When a program pretends to be a web browser and retrieves a web page, then looks at the web page content. Often programs are following the links in one page to find the next page so they can traverse a network of pages or a social network.[39]
SOA
Service-Oriented Architecture. When an application is made of components connected across a network.[40]
socket
A network connection between two applications where the applications can send and receive data in either direction.[41]
spider
The act of a web search engine retrieving a page and then all the pages linked from a page and so on until they have nearly all of the pages on the Internet which they use to build their search index.[42]
XML
eXtensible Markup Language. A format that allows for the markup of structured data.[43]

Review Questions edit

Enable JavaScript to hide answers.
Click on a question to see the answer.
  1. HyperText Markup Language (HTML) is _____.
    HyperText Markup Language (HTML) is the standard markup language for creating web pages and web applications.
  2. HTML describes _____.
    HTML describes the structure of a web page semantically and originally included cues for the appearance (layout) of the document.
  3. HTML elements are _____.
    HTML elements are the building blocks of HTML pages.
  4. HTML elements are delineated by _____.
    HTML elements are delineated by tags, written using angle brackets.
  5. Tags are typically written using the syntax _____. 
    Tags are typically written using the syntax <tag>content</tag>. 
  6. Some tags are written using the syntax _____.
    Some tags are written using the syntax <tag content />.
  7. Browsers do not display HTML tags, but use them _____.
    Browsers do not display HTML tags, but use them to interpret the content of the page.
  8. Cascading Style Sheets (CSS) define _____.
    Cascading Style Sheets (CSS) define the look and layout of content.
  9. The HTML start tag may also include _____.
    The HTML start tag may also include attributes within the tag, using the syntax <tag attribute="value" ... >content</tag>.
  10. The style attribute may be used to _____.
    The style attribute may be used to embed CSS style inside HTML tags using the syntax <tag style="property:value; ...">content</tag>.
  11. HTML comments are written using the syntax _____.
    HTML comments are written using the syntax <!-- comment -->.
  12. Extensible Markup Language (XML) is _____.
    Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
  13. Although the design of XML focuses on documents, the language is widely used for _____.
    Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services.
  14. Well-formed XML follows a syntax similar to HTML, using _____.
    Well-formed XML follows a syntax similar to HTML, using nested tags to represent data structure and values.
  15. JSON (JavaScript Object Notation) is _____.
    JSON (JavaScript Object Notation) is an open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs.
  16. JSON is the most common data format used for _____.
    JSON is the most common data format used for asynchronous browser/server communication, largely replacing XML.
  17. Simple Mail Transfer Protocol (SMTP) is _____.
    Simple Mail Transfer Protocol (SMTP) is an Internet standard for electronic mail (email) transmission.
  18. Although electronic mail servers and other mail transfer agents use SMTP to send and receive mail messages, user-level client mail applications typically use _____.
    Although electronic mail servers and other mail transfer agents use SMTP to send and receive mail messages, user-level client mail applications typically use SMTP only for sending messages to a mail server for relaying. For retrieving messages, client applications usually use either IMAP or POP3.
  19. SMTP communication between mail servers by default uses TCP port _____. Mail clients often use port _____ to submit emails to the mail service. Despite being deprecated, the nonstandard port _____ is commonly used by mail providers.
    SMTP communication between mail servers by default uses the TCP port 25. Mail clients often use port 587 to submit emails to the mail service. Despite being deprecated, the nonstandard port 465 is commonly used by mail providers.
  20. SMTP connections secured by SSL, known as _____.
    SMTP connections secured by SSL, known as SMTPS, can be made using STARTTLS.
  21. The urllib.request.urlopen() method _____.
    The urllib.request.urlopen() method opens the given URL. For HTTP and HTTPS URLs, it returns an http.client.HTTPResponse object that may be read like a file object.
  22. The xml.etree.ElementTree.fromstring() method _____.
    The xml.etree.ElementTree.fromstring() method parses XML from a string directly into an XML Element, which is the root element of the parsed tree.
  23. The xml.etree.ElementTree.iter() method _____.
    The xml.etree.ElementTree.iter() method returns an iterator that loops over all elements in the tree, in section order.
  24. The json.loads() method _____.
    The json.loads() method converts a given JSON string to a corresponding Python object (dict, list, string, etc.).
  25. The smtplib module _____.
    The smtplib module defines an SMTP client session object that can be used to send mail to any Internet machine with an SMTP or ESMTP listener daemon.

Assessments edit

See Also edit

References edit