An Application¶
We sketch an application.
A Virtual University¶
The running example in the book Making Use of Python
of Rashi Gupta is the administration of Techsity University
.
The application rests on four pillars:
- web forms with CGI scripts,
- information management with databases,
- multiple servers to handle the load,
- multithreaded servers handle many clients.
These topics correspond to the four weeks after the first midterm exam. In the fifth week after the first midterm exam, we studied combinations of those topics.
With CGI Programming we build web interfaces to administer courses. The course administration system allows for browsing the course catalog, for answering queries about courses, and for handling the course registration. Students in online course browse the course materials, download notes and slides, interact in online classes and labs; and upload their answers to assignments.
With database programming we administer the courses in three tables. The first table maintains information about students; the second one contains the course information (description, prerequisites, etc.); and the third table links students to courses, storing the enrollments. Every course has its own database, for a detailed syllabus; assignments, notes and slides; and for the administration of the grades.
With network programming we maintain multiple servers to handle the load. The servers handle the online registration, manage the running of the courses, and back up essential data. We expect our servers to be robust and keep running during peak periods for registration, and during prime time for online courses.
With multithreaded programming, our servers are able to handle an indefinite number of requests, for an indefinite time. The distributed computing over multiple computers applies load balancing and rescheduling of requests.
Uploading Files¶
In our virtual university, answer to assignments will be uploaded via an HTML form. We must specify the encoding of the form adding
enctype = "multitype/form-data"
as one of the tags.
We then use an input element of type file
.
For example:
<input~type="file"~name="upfile"~size="50">
The HTML code in the file uploadfile.html
is listed below.
<html>
<head>
<title> MCS 275 Lec 36: uploading a file </title>
</head>
<body>
<h1> form to upload a file </h1>
<form method="post"
action="http://localhost/cgi-bin/uploadfile.py"
enctype="multipart/form-data">
<input type="file" name="upfile" size = "50">
<p> <input type="submit" value="submit your file">
<input type="reset" value="cancel selection"> </p>
</form>
</body>
</html>
The uploaded file is processed with a CGI script.
The name
field of the input element
has value 'upfile'
.
In the CGI script, we get the form via
form = cgi.FieldStorage
Using the key name
through its value as defined
in the form to access the file:
uploaded = form['upfile']
Using the file attribute of uploaded
for reading:
line = uploaded.file.readline()}
The script to print the first line of
the file in uploadfile.py
is listed below
#!/usr/bin/python
"""
This CGI script takes the input of the form
uploadfile.html and writes the first line of
the file in plaintext on the web page.
"""
import cgi
FORM = cgi.FieldStorage()
print("Content-Type: text/plain\n")
UPLOADED = FORM['upfile']
LINE = UPLOADED.file.readline()
print(LINE)
The form is defined by uploadfile.html
and
uploadfile.py
defines the action.
Apache is needed to test this approach.
Combining the printing of the form with the processing of its input
into one script works with myserver.py
.
Steps in the writing of the combined solution:
- Copy the HTML code in
uploadfile.html
intouploadform.py
and define a functionprint_form
. Change the action of the form touploadform.py
. - Add a
main()
and aprint_header
to theuploadform.py
and test withmyserver.py
. - Copy the code of
uploadfile.py
as a function inuploadform.py
. - Update the
main()
with anif else
statement.
The printing of the form is defined below.
def print_form():
"""
Prints the form to upload a file.
"""
print("""
<html>
<body>
<h1> form to upload a file </h1>
<form method="post"
action="uploadform.py"
enctype="multipart/form-data">
<input type="file" name="upfile" size = "50">
<p> <input type="submit" value="submit your file">
<input type="reset" value="cancel selection"> </p>
</form>
</body>
</html>""")
Code for the processing of the file is
defined in the function show_firstline
.
def show_firstline(form):
"""
Prints the first line of the uploaded file
for display in the web browser.
"""
uploaded = form['upfile']
line = uploaded.file.readline()
print("""
<html>
<body>
<h1> first line of the uploaded file </h1>
%s
</body>
</html>
""" % line)
Code for the function main()
is listed below.
def main():
"""
Prints the form or process the answer.
"""
print_header("uploading a file")
form = cgi.FieldStorage()
if 'upfile' not in form:
print_form()
else:
show_firstline(form)
Course Listings at UIC¶
One quick way to create a database of considerable size is to load it with data available on web pages.
An application: a database of courses at UIC, the data is already structured and readily available at UIC’s web pages. The stages in this project are:
- grab the data from the web into a file;
- format the data on file into data tuples; and
- insert data tuples into a database.
Running the script courselistings.py
produces the output as listed below.
$ python courselistings.py
Give a subject : lat
Give the year (4 digits) : 2016
Spring, Fall, or Summer (0, 1, or 2) : 0
opening http://osss.uic.edu/ims/classschedule/S2016/LAT.htm ...
courses on http://osss.uic.edu/ims/classschedule/S2016/LAT.htm :
LAT102 Elementary Latin II
LAT104 Intermediate Latin II
LAT299 Independent Reading
$
Course archives at UIC are available via the web site for the office of student services (osss). The base of the URL is
http://osss.uic.edu/ims/classschedule/
followed by S
, SUM
, or F
for spring, summer, or fall semester,
followed by year as 4-digit number, e.g. 2010
,
followed by the subject, e.g.: LAT
.
The HTMLParser
module helps to parse html code.
It is available in the standard Python distribution:
>>> from html.parser import HTMLParser
>>> help(HTMLParser)
The class HTMLParser
- allows to override handlers of tags,
- provides a
feed
method which handles the buffering.
A raw template of using the module HTMLParser
is outline below.
from http.parser import HTMLParser
from urllib import urlopen
class OurHTMLParser(HTMLParser):
def __init__(self):
def handle_starttag(self, tag, attrs):
def handle_endtag(self, tag):
def main():
f = urlopen(page)
p = OurHTMLParser()
while True:
data = f.read(80)
if data == '': break
p.feed(data)
p.close()
Our first application of this class is to gather basic statistics about a page: we list the types of tags on the page and count the number of occurrences for each tag. At end of each tag the tally is updated.
The natural data structure for the tally is of course adictionary:
- keys: string with type of tag,
- values: natural number counts the number of occurrences.
The tally is an object data attribute.
In the definition of our class TagTally
we override
the methods in HTMLParser
.
The documentation strings are listed below.
from html.parser import HTMLParser
from urllib.request import urlopen
class TagTally(HTMLParser):
"""
Makes a tally of ending tags.
"""
def __init__(self):
"""
Initializes the dictionary of tags.
"""
def __str__(self):
"""
Returns the string representations of tags.
"""
def handle_endtag(self, tag):
"""
Maintains a tally of the tags.
"""
The main function in the script tallytags.py
(which contains the definition of the class TagTally
)
is listed below.
def main():
"""
Opens a web page and parses it.
"""
url = 'http://www.uic.edu'
print('opening %s ...' % url)
page = urlopen(url)
tags = TagTally()
while True:
data = page.read(80).decode()
if data == '':
break
tags.feed(data)
tags.close()
print('the tally of tags :')
print(tags)
In the constructor of the class TagTally
,
we initialize parent class.
def __init__(self):
"""
Initializes the dictionary of tags.
"""
HTMLParser.__init__(self)
self.tags = {}
The data attribute tags
is the dictionary with all tags
on the web page. What is printed is defined by the string
representation below.
def __str__(self):
"""
Returns the string representation of tags.
"""
result = ''
for tag in self.tags:
result += str(tag) + ':' + str(self.tags[tag]) + '\n'
return result[:-1]
In the update of the tally, we first check if there is already a tag.
def handle_endtag(self, tag):
"""
Maintains a tally of the tags.
"""
if tag in self.tags:
self.tags[tag] = self.tags[tag] + 1
else:
self.tags.update({tag: 1})
If no tag is present, then we update the dictionary.
The handle_endtag
above contains the same instructions
as covered before in the construction of a frequency table.
In a web crawler, we want to get the links the page refers to.
In our previous code we searched for double quoted strings
which started with http
.
A more proper way to get the hyperlinks proceeds as follows:
- look for tags of type
'a'
, - where the name of the attribute is
href
; and then - get the hyperlink corresponding to
href
.
The documentation strings of the class HTMLrefs
are listed below.
from html.parser import HTMLParser
from urllib.request import urlopen
class HTMLrefs(HTMLParser):
"""
Makes a list of all html links.
"""
def __init__(self):
"""
Initializes the list of links.
"""
def __str__(self):
"""
Returns the string rep of the links.
"""
def handle_starttag(self, tag, attrs):
"""
Looks for tags equal to 'a' and
stores links for href attributes.
"""
The definition of the class HTMLrefs
is contained
in the script htmlrefs.py
, where the main function
is listed below.
def main():
"""
Opens a web page and parses it.
"""
url = 'http://www.uic.edu/'
print('opening %s ...' % url)
page = urlopen(url)
refs = HTMLrefs()
while True:
data = page.read(80).decode()
if data == '':
break
refs.feed(data)
refs.close()
print('all html links :')
print(refs)
We use a list as the object data attribute to store the links.
def __init__(self):
"""
Initializes the list of links.
"""
HTMLParser.__init__(self)
self.refs = []
The printing of the list of links is defined
by the string representation in the class HTMLrefs
.
def __str__(self):
"""
Returns the string rep of the links.
"""
result = ''
for link in self.refs:
result += link + '\n'
return result[:-1]
Now we still have to filter the attributes. Attributes are lists of tuples, for example:
[('href', 'learning.shtml'), ...]
The link is the y
in the tuple (x, y)
.
def handle_starttag(self, tag, attrs):
"""
Looks for tags equal to 'a' and
stores links for href attributes.
"""
print(attrs)
if tag == 'a':
F = [x_y for x_y in attrs if x_y[0] == 'href']
L = [y for (x, y) in F]
self.refs = self.refs + L
Exercises¶
- Write a script that prompts the user for an URL and that finds the number of forms on the web page. The script should not crash when the page fails to open, but it should then display an error message.
- Write a script to look for
files with the extension
.py
. - Consider
webcrawler.py
of lecture 34. UseHTMLParser
to write a shorter version. - Make a class
BoldText
so all text formatted in bold is stored in a list in an object data attribute.