An Application

We sketch an application.

A Virtual University

The running example in the book Making Use of Python of Rashi Gupta is the administration of Techsity University. The application rests on four pillars:

  1. web forms with CGI scripts,
  2. information management with databases,
  3. multiple servers to handle the load,
  4. multithreaded servers handle many clients.

These topics correspond to the four weeks after the first midterm exam. In the fifth week after the first midterm exam, we studied combinations of those topics.

With CGI Programming we build web interfaces to administer courses. The course administration system allows for browsing the course catalog, for answering queries about courses, and for handling the course registration. Students in online course browse the course materials, download notes and slides, interact in online classes and labs; and upload their answers to assignments.

With database programming we administer the courses in three tables. The first table maintains information about students; the second one contains the course information (description, prerequisites, etc.); and the third table links students to courses, storing the enrollments. Every course has its own database, for a detailed syllabus; assignments, notes and slides; and for the administration of the grades.

With network programming we maintain multiple servers to handle the load. The servers handle the online registration, manage the running of the courses, and back up essential data. We expect our servers to be robust and keep running during peak periods for registration, and during prime time for online courses.

With multithreaded programming, our servers are able to handle an indefinite number of requests, for an indefinite time. The distributed computing over multiple computers applies load balancing and rescheduling of requests.

Uploading Files

In our virtual university, answer to assignments will be uploaded via an HTML form. We must specify the encoding of the form adding

enctype = "multitype/form-data"

as one of the tags. We then use an input element of type file. For example:

<input~type="file"~name="upfile"~size="50">

The HTML code in the file uploadfile.html is listed below.

<html>
<head>
<title> MCS 275 Lec 36: uploading a file </title>
</head>
<body>
<h1> form to upload a file </h1>
<form method="post"
      action="http://localhost/cgi-bin/uploadfile.py"
      enctype="multipart/form-data">
<input type="file" name="upfile" size = "50">
<p> <input type="submit" value="submit your file">
    <input type="reset" value="cancel selection"> </p>
</form>
</body>
</html>

The uploaded file is processed with a CGI script. The name field of the input element has value 'upfile'. In the CGI script, we get the form via

form = cgi.FieldStorage

Using the key name through its value as defined in the form to access the file:

uploaded = form['upfile']

Using the file attribute of uploaded for reading:

line = uploaded.file.readline()}

The script to print the first line of the file in uploadfile.py is listed below

#!/usr/bin/python
"""
This CGI script takes the input of the form
uploadfile.html and writes the first line of
the file in plaintext on the web page.
"""
import cgi
FORM = cgi.FieldStorage()

print("Content-Type: text/plain\n")

UPLOADED = FORM['upfile']
LINE = UPLOADED.file.readline()
print(LINE)

The form is defined by uploadfile.html and uploadfile.py defines the action. Apache is needed to test this approach. Combining the printing of the form with the processing of its input into one script works with myserver.py.

Steps in the writing of the combined solution:

  1. Copy the HTML code in uploadfile.html into uploadform.py and define a function print_form. Change the action of the form to uploadform.py.
  2. Add a main() and a print_header to the uploadform.py and test with myserver.py.
  3. Copy the code of uploadfile.py as a function in uploadform.py.
  4. Update the main() with an if else statement.

The printing of the form is defined below.

def print_form():
    """
    Prints the form to upload a file.
    """
    print("""
<html>
<body>

<h1> form to upload a file </h1>

<form method="post"
      action="uploadform.py"
      enctype="multipart/form-data">

<input type="file" name="upfile" size = "50">

<p> <input type="submit" value="submit your file">
    <input type="reset" value="cancel selection"> </p>
</form>

</body>
</html>""")

Code for the processing of the file is defined in the function show_firstline.

def show_firstline(form):
    """
    Prints the first line of the uploaded file
    for display in the web browser.
    """
    uploaded = form['upfile']
    line = uploaded.file.readline()
    print("""
<html>
<body>

<h1> first line of the uploaded file </h1>

%s
</body>
</html>
""" % line)

Code for the function main() is listed below.

def main():
    """
    Prints the form or process the answer.
    """
    print_header("uploading a file")
    form = cgi.FieldStorage()
    if 'upfile' not in form:
        print_form()
    else:
        show_firstline(form)

Course Listings at UIC

One quick way to create a database of considerable size is to load it with data available on web pages.

An application: a database of courses at UIC, the data is already structured and readily available at UIC’s web pages. The stages in this project are:

  1. grab the data from the web into a file;
  2. format the data on file into data tuples; and
  3. insert data tuples into a database.

Running the script courselistings.py produces the output as listed below.

$ python courselistings.py
Give a subject : lat
Give the year (4 digits) : 2016
Spring, Fall, or Summer (0, 1, or 2) : 0
opening http://osss.uic.edu/ims/classschedule/S2016/LAT.htm ...
courses on http://osss.uic.edu/ims/classschedule/S2016/LAT.htm :
LAT102 Elementary Latin II
LAT104 Intermediate Latin II
LAT299 Independent Reading
$

Course archives at UIC are available via the web site for the office of student services (osss). The base of the URL is

http://osss.uic.edu/ims/classschedule/

followed by S, SUM, or F for spring, summer, or fall semester, followed by year as 4-digit number, e.g. 2010, followed by the subject, e.g.: LAT.

The HTMLParser module helps to parse html code. It is available in the standard Python distribution:

>>> from html.parser import HTMLParser
>>> help(HTMLParser)

The class HTMLParser

  • allows to override handlers of tags,
  • provides a feed method which handles the buffering.

A raw template of using the module HTMLParser is outline below.

from http.parser import HTMLParser
from urllib import urlopen

class OurHTMLParser(HTMLParser):

    def __init__(self):

    def handle_starttag(self, tag, attrs):

    def handle_endtag(self, tag):

def main():
    f = urlopen(page)
    p = OurHTMLParser()
    while True:
        data = f.read(80)
        if data == '': break
        p.feed(data)
    p.close()

Our first application of this class is to gather basic statistics about a page: we list the types of tags on the page and count the number of occurrences for each tag. At end of each tag the tally is updated.

The natural data structure for the tally is of course adictionary:

  • keys: string with type of tag,
  • values: natural number counts the number of occurrences.

The tally is an object data attribute.

In the definition of our class TagTally we override the methods in HTMLParser. The documentation strings are listed below.

from html.parser import HTMLParser
from urllib.request import urlopen

class TagTally(HTMLParser):
    """
    Makes a tally of ending tags.
    """
    def __init__(self):
        """
        Initializes the dictionary of tags.
        """
    def __str__(self):
        """
        Returns the string representations of tags.
        """
    def handle_endtag(self, tag):
        """
        Maintains a tally of the tags.
        """

The main function in the script tallytags.py (which contains the definition of the class TagTally) is listed below.

def main():
    """
    Opens a web page and parses it.
    """
    url = 'http://www.uic.edu'
    print('opening %s ...' % url)
    page = urlopen(url)
    tags = TagTally()
    while True:
        data = page.read(80).decode()
        if data == '':
            break
        tags.feed(data)
    tags.close()
    print('the tally of tags :')
    print(tags)

In the constructor of the class TagTally, we initialize parent class.

def __init__(self):
    """
    Initializes the dictionary of tags.
    """
    HTMLParser.__init__(self)
    self.tags = {}

The data attribute tags is the dictionary with all tags on the web page. What is printed is defined by the string representation below.

def __str__(self):
    """
    Returns the string representation of tags.
    """
    result = ''
    for tag in self.tags:
        result += str(tag) + ':' + str(self.tags[tag]) + '\n'
    return result[:-1]

In the update of the tally, we first check if there is already a tag.

def handle_endtag(self, tag):
    """
    Maintains a tally of the tags.
    """
    if tag in self.tags:
        self.tags[tag] = self.tags[tag] + 1
    else:
        self.tags.update({tag: 1})

If no tag is present, then we update the dictionary. The handle_endtag above contains the same instructions as covered before in the construction of a frequency table.

In a web crawler, we want to get the links the page refers to. In our previous code we searched for double quoted strings which started with http. A more proper way to get the hyperlinks proceeds as follows:

  1. look for tags of type 'a',
  2. where the name of the attribute is href; and then
  3. get the hyperlink corresponding to href.

The documentation strings of the class HTMLrefs are listed below.

from html.parser import HTMLParser
from urllib.request import urlopen

class HTMLrefs(HTMLParser):
    """
    Makes a list of all html links.
    """
    def __init__(self):
        """
        Initializes the list of links.
        """
    def __str__(self):
        """
        Returns the string rep of the links.
        """
    def handle_starttag(self, tag, attrs):
        """
        Looks for tags equal to 'a' and
        stores links for href attributes.
        """

The definition of the class HTMLrefs is contained in the script htmlrefs.py, where the main function is listed below.

def main():
    """
    Opens a web page and parses it.
    """
    url = 'http://www.uic.edu/'
    print('opening %s ...' % url)
    page = urlopen(url)
    refs = HTMLrefs()
    while True:
        data = page.read(80).decode()
        if data == '':
            break
        refs.feed(data)
    refs.close()
    print('all html links :')
    print(refs)

We use a list as the object data attribute to store the links.

def __init__(self):
    """
    Initializes the list of links.
    """
    HTMLParser.__init__(self)
    self.refs = []

The printing of the list of links is defined by the string representation in the class HTMLrefs.

def __str__(self):
    """
    Returns the string rep of the links.
    """
    result = ''
    for link in self.refs:
        result += link + '\n'
    return result[:-1]

Now we still have to filter the attributes. Attributes are lists of tuples, for example:

[('href', 'learning.shtml'), ...]

The link is the y in the tuple (x, y).

def handle_starttag(self, tag, attrs):
    """
    Looks for tags equal to 'a' and
    stores links for href attributes.
    """
    print(attrs)
    if tag == 'a':
        F = [x_y for x_y in attrs if x_y[0] == 'href']
        L = [y for (x, y) in F]
        self.refs = self.refs + L

Exercises

  1. Write a script that prompts the user for an URL and that finds the number of forms on the web page. The script should not crash when the page fails to open, but it should then display an error message.
  2. Write a script to look for files with the extension .py.
  3. Consider webcrawler.py of lecture 34. Use HTMLParser to write a shorter version.
  4. Make a class BoldText so all text formatted in bold is stored in a list in an object data attribute.