Processing Data from the Web¶
As an application we consider the creation of mysql tables with schedule information of the Chicago Transit Authority (CTA).
CTA Tables¶
We can download the schedules of the CTA, via
http://www.transitchicago.com/developers/gtfs.aspx
.
The GTFS (General Transit Feed Specification)
is an open format for packaging scheduled service data.
A GTFS feed is a series of text files with data
on lines separated by commas (csv format).
Each file is a table in a relational database.
Some tables in the feed are the following:
stops.txt
: stop locations for bus or trainroutes.txt
: route list with unique identifierstrips.txt
: information about each trip by a vehiclestop_times.txt
: scheduled arrival and departure times for each stop on each trip.
Before we start with mysql, we write scripts to search
the files. The first script will prompt the user for a number
and then look whether the number corresponds to a stop id
in the file stops.txt
. If there is a correspondence,
then the name corresponding with the stop id
will be printed.
A run with the script is below.
$ python3 ctastopname.py
opening CTA/stops.txt ...
give a stop id : 3021
skipping line 0
3021 has name "California & Augusta"
The script looks for the line
3021,3021,"California & Augusta",41.89939053,-87.69688045,0,,1
In a Terminal window, we can type
$ cat stops.txt | grep ",3021,"
3021,3021,"California & Augusta",41.89939053,-87.69688045,0,,1
$
The file stops.txt
is saved in the subdirectory CTA
of the current directory where the script ctastopname.py
is.
The code for ctastopname.py
is listed next.
FILENAME = 'CTA/stops.txt'
print('opening', FILENAME, '...')
DATAFILE = open(FILENAME, 'r')
STOPID = int(input('give a stop id : '))
COUNT = 0
STOPNAME = None
while True:
LINE = DATAFILE.readline()
if LINE == '':
break
L = LINE.split(',')
try:
if int(L[0]) == STOPID:
STOPNAME = L[2]
break
except:
print('skipping line', COUNT)
COUNT = COUNT + 1
print(STOPID, 'has name', STOPNAME)
The other data file we consider is stop_times.txt
.
For a given stop, we search through that file for the trips
the pass at this stop.
The script ctastoptimes.py
prompts the user for
an identification of a stop and then prints
all CTA vehicles that stop there.
$ python3 ctastoptimes.py
opening CTA/stop_times.txt ...
give a stop id : 3021
skipping line 0
adding "63rd Pl/Kedzie"
adding "Jackson"
['"63rd Pl/Kedzie"', '"Jackson"']
We scan the lines in stop_times.txt
for
where the given stop identification occurs.
Code for ctastoptimes.py
is listed below.
FILENAME = 'CTA/stop_times.txt'
print('opening', FILENAME, '...')
DATAFILE = open(FILENAME, 'r')
STOPID = int(input('give a stop id : '))
COUNT = 0
TIMES = []
while True:
LINE = DATAFILE.readline()
if LINE == '':
break
L = LINE.split(',')
try:
if int(L[3]) == STOPID:
if L[5] not in TIMES:
print('adding', L[5])
TIMES.append(L[5])
except:
print('skipping line', COUNT)
COUNT = COUNT + 1
print(TIMES)
CTA Tables in MySQL¶
Searching through the files works for an one time lookup,
but is inefficient for frequent consultation.
A GTFS feed is a series of text files with data
on lines separated by commas (csv format).
Each file is a table in a relational database.
We call our database CTA
and will add tables
reading the information from stops.txt
.
The first line in stops.txt
lists nine fields.
For each field, we select an appropriate mysql
type,
as enumerated below:
stop_id
: typeINT
stop_code
: typeINT
stop_name
: typeCHAR(80)
stop_desc
: typeVARCHAR(128)
stop_lat
: typeFLOAT
stop_lon
: typeFLOAT
location_type
: typeINT
parent_station
: typeINT
wheelchair_boarding
: typeSMALLINT
Depending on the setup of mysql
, we may have
to execute mysql
as superuser (or sudo
).
To make a database CTA
, we run mysqladmin
at the command prompt.
$ mysqladmin create CTA
Then we start mysql
, to create a table in the database:
mysql> use CTA;
Database changed
mysql> create table stops
-> (id INT, code INT, name CHAR(80),
-> ndesc VARCHAR(128),
-> lat FLOAT, lon FLOAT,
-> tp INT, ps INT, wb SMALLINT);
Query OK, 0 rows affected (0.01 sec)
To double check if the creation of the table stops
was a success, we check as follows.
mysql> explain stops;
+-------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id | int(11) | YES | | NULL | |
| code | int(11) | YES | | NULL | |
| name | char(80) | YES | | NULL | |
| ndesc | varchar(128) | YES | | NULL | |
| lat | float | YES | | NULL | |
| lon | float | YES | | NULL | |
| tp | int(11) | YES | | NULL | |
| ps | int(11) | YES | | NULL | |
| wb | smallint(6) | YES | | NULL | |
+-------+--------------+------+-----+---------+-------+
9 rows in set (0.00 sec)
The first data line in stops.txt
contains
1,1,"Jackson & Austin Terminal","Jackson & Austin Terminal, Northeastbound, Bus Terminal",41.87632184,-87.77410482,0,,1
We can manually insert this data line as
mysql> insert into stops values
-> (1,1,"Jackson & Austin Terminal",
-> "Jackson & Austin Terminal, Northeastbound, Bus Terminal",
-> 41.87632184,-87.77410482,0,0,1);
Query OK, 1 row affected (0.00 sec)
To check the above insert, we do a simple query.
mysql> select name from stops where id = 1;
+---------------------------+
| name |
+---------------------------+
| Jackson & Austin Terminal |
+---------------------------+
1 row in set (0.00 sec)
To delete a row, given its id
we execute the query:
mysql> delete from stops where id = 1;
Query OK, 1 row affected (0.65 sec)
mysql> select * from stops;
Empty set (0.01 sec)
If the where
clause is omitted,
then all rows in the table are deleted.
Typing in 12,162 entries is rather tedious.
Therefore, we will fill the table stops
with a Python script.
After filling the table stops
of
the database we query the table for a name:
mysql> select name from stops where id = 3021;
+----------------------+
| name |
+----------------------+
| California & Augusta |
+----------------------+
1 row in set (0.00 sec)
Our data is on file:
FILENAME = 'CTA/stops.txt'
The script dbctafillstops.py
has the function fillstops()
defined below.
import pymysql
def fillstops(printonly=True):
"""
Opens the file with name 'FILENAME', reads every line
and inserts the data into the table 'stops'.
"""
if printonly:
crs = None
else:
cta = pymysql.connect(db='CTA')
crs = cta.cursor()
print('opening', FILENAME, '...')
datafile = open(FILENAME, 'r')
line = datafile.readline() # skip the first line
while True:
line = datafile.readline()
if(line == ''):
break
insert_data(crs, line, printonly)
if not printonly:
cta.commit()
crs.close()
datafile.close()
For the changes to take effect, we must do a commit()
.
With rollback()
, we can cancel the current transaction,
provided the database and tables support transactions.
In extracting data, consider the description of a stop name:
"Jackson & Austin Terminal, Northeastbound, Bus Terminal"
Observe the commas in the string! Splitting the line read from file first on commas may be complicated to process. Therefore, we first split on the double quotes to extract the name and the description of the stop first.
def extract(line):
"""
Returns a list of 9 elements extracted from the
string line. Missing data are replaced by '0'.
"""
result = []
strd = line.split('\"') # extract strings first
(name, desc) = (strd[1], strd[3])
data = strd[0].split(',')
result.append('0' if data[0] == '' else data[0])
result.append('0' if data[1] == '' else data[1])
result.append(name)
result.append(desc)
data = strd[4].split('\n') # remove newline
data = data[0].split(',')
for k in range(1, len(data)):
result.append('0' if data[k] == '' else data[k])
while len(result) < 9:
result.append('0')
return result
The list on return will always have nine items. The insertion of data is performed by the function below.
def insert_data(cur, line, printonly=True):
"""
Inserts the data in the string line,
using the cursor c, if printonly is False.
"""
data = extract(line)
dbc = 'INSERT INTO stops VALUES ('
dbc += data[0] + ',' + data[1] + ','
dbc += '\"' + data[2] + '\",' # name is a string
dbc += '\"' + data[3] + '\",' # description
for k in range(4, 8):
dbc += data[k] + ','
dbc += data[8] + ')'
print(repr(dbc)) # print raw string
if not printonly:
cur.execute(dbc)
Som queries in the table are shown below.
mysql> select id from stops
-> where name = "California & Augusta";
+-------+
| id |
+-------+
| 3021 |
| 17154 |
+-------+
2 rows in set (0.00 sec)
mysql> select name from stops where id = 17154;
+----------------------+
| name |
+----------------------+
| California & Augusta |
+----------------------+
1 row in set (0.01 sec)
After the filling of the table stops
,
we can query for the stop name, given the stop id,
as illustrated below.
$ python3 dbctastopquery.py
give a stop id : 3021
3021 has name California & Augusta
$
$ python3 dbctastopquery.py
give a stop id : 0
0 has name None
$
The function main in dbctastopquery.py
i
is listed below.
import pymysql
def main():
"""
Connects to the database,
prompts the user for a stop id
and the queries the stops table.
"""
cta = pymysql.connect(db='CTA')
crs = cta.cursor()
stop = int(input('give a stop id : '))
name = get_stop_name(crs, stop)
print(stop, 'has name', name)
cta.close()
The execution of the query is defined in
the function get_stop_name
, shown below.
def get_stop_name(crs, stopid):
"""
Given a cursor crs to the CTA database,
queries the stops table for the stop id.
Returns None if the stop id has not been
found, otherwise returns the stop name.
"""
sel = 'SELECT name FROM stops'
whe = ' WHERE id = %d' % stopid
query = sel + whe
returned = crs.execute(query)
if returned == 0:
return None
else:
tpl = crs.fetchone()
return tpl[0]
The connections between the stops are defined in the
file stop_times.txt
. The first line in the file
list the following fields, with corresponding mysql
type.
trip_id
: typeBIGINT
arrival_time
: typeTIME
departure_time
: typeTIME
stop_id
: typeINT
stop_sequence
: typeINT
stop_headsign
: typeVARCHAR(80)
pickup_type
: typeINT
shape_dist_traveled
: typeINT
A BIGINT
is a 64-bit integer,
while an INT
is a 32-bit integer.
To our database CTA
we add a new table
with the command below.
mysql> create table stop_times
-> (id BIGINT, arrival TIME, departure TIME,
-> stop INT, seq INT, head VARCHAR(80),
-> ptp INT, sdt INT);
Query OK, 0 rows affected (0.02 sec)
Note the types BIGINT
and TIME
.
As a check we execute the explain
.
mysql> explain stop_times;
+-----------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+---------+-------+
| id | bigint(20) | YES | | NULL | |
| arrival | time | YES | | NULL | |
| departure | time | YES | | NULL | |
| stop | int(11) | YES | | NULL | |
| seq | int(11) | YES | | NULL | |
| head | varchar(80) | YES | | NULL | |
| ptp | int(11) | YES | | NULL | |
| sdt | int(11) | YES | | NULL | |
+-----------+-------------+------+-----+---------+-------+
8 rows in set (0.00 sec)
An example of a query to manually insert a record is given next.
mysql> insert into stop_times values (
-> 46035893,"12:09:14","12:09:14",6531,29,
-> "Midway Orange Line",0,18625);
Query OK, 1 row affected (0.00 sec)
mysql> select departure, head from stop_times;
+-----------+--------------------+
| departure | head |
+-----------+--------------------+
| 12:09:14 | Midway Orange Line |
+-----------+--------------------+
1 row in set (0.00 sec)
The filling of the table on On Mac OS X laptop, happens as follows.
$ python3 dbctafillstoptimes.py
opening CTA/stop_times.txt ...
dbctafillstoptimes.py:26: Warning: Out of range
value for column 'id' at row 1 c.execute(d)
Redo on a fast Linux Workstation:
# time python3 dbctafillstoptimes.py
opening CTA/stop_times.txt ...
dbctafillstoptimes.py:26: Warning: Out of range
value for column 'id' at row 1 c.execute(d)
real 5m32.433s
user 1m11.921s
sys 0m17.735s
While running dbctafillstoptimes.py
,
the memory consumption of Python and mysql was
of the same magnitude, about 300Mb.
The result is about half a million records,
as verified below.
mysql> select count(*) from stop_times;
+----------+
| count(*) |
+----------+
| 5455515 |
+----------+
1 row in set (1.46 sec)
The insertion of the data is defined by the function below.
def insert_data(crs, scsv, printonly=True):
"""
Inserts the data in the comma separated
string scsv using the cursor crs.
"""
data = scsv.split(',')
cmd = 'INSERT INTO stop_times VALUES ('
cmd += ('0,' if data[0] == '' else data[0] + ',')
cmd += '\"' + data[1] + '\"' + ','
cmd += '\"' + data[2] + '\"' + ','
cmd += data[3] + ',' + data[4] + ','
cmd += data[5] + ',' + data[6] + ','
wrk = data[7] # must cut off the '\n'
data7 = wrk[0:len(wrk)-1] + ')'
cmd += ('0)' if data[7] == '' else data7)
print(repr(cmd))
if not printonly:
crs.execute(cmd)
Illustration of querying stop_times
are below.
mysql> select head from stop_times
-> where stop = 3021 and
-> arrival < "05:30:00";
+----------------+
| head |
+----------------+
| 63rd Pl/Kedzie |
| 63rd Pl/Kedzie |
| 63rd Pl/Kedzie |
| 63rd Pl/Kedzie |
+----------------+
4 rows in set (0.94 sec)
A more involved query asks for all CTA vehicals which halt at a stop before 5:30AM.
mysql> select name, departure, head
-> from stops, stop_times
-> where stops.id = 3021
-> and stops.id = stop_times.stop
-> and stop_times.departure < "05:30:00";
+----------------------+-----------+----------------+
| name | departure | head |
+----------------------+-----------+----------------+
| California & Augusta | 04:43:49 | 63rd Pl/Kedzie |
| California & Augusta | 05:03:49 | 63rd Pl/Kedzie |
| California & Augusta | 05:19:49 | 63rd Pl/Kedzie |
| California & Augusta | 05:12:49 | 63rd Pl/Kedzie |
+----------------------+-----------+----------------+
4 rows in set (0.57 sec)
Exercises¶
- Modify
ctastopname.py
so the user is prompted for a string instead of a number. The modified script prints all id’s and corresponding names that have the given string as substring. Use thein
operator. - The file
stops.txt
contains the latitude and longitude of each stop. Use these coordinates to plot (either with pylab, pyplot, or a Tkinter canvas) the blue line from O’Hare to Forest Park. Use a proper scaling so your plot resembles what we see on a map. - Write a Python script to return the name of stop,
given its id, using the table
stops
. - Design a GUI with Tkinter to query the stop name: one entry field for the stop id, another for the name of the stop, and one button in the middle to execute the query. Note that the GUI allows to query given the stop id or given the stop name.