Processing Data from the Web

As an application we consider the creation of mysql tables with schedule information of the Chicago Transit Authority (CTA).

CTA Tables

We can download the schedules of the CTA, via http://www.transitchicago.com/developers/gtfs.aspx. The GTFS (General Transit Feed Specification) is an open format for packaging scheduled service data. A GTFS feed is a series of text files with data on lines separated by commas (csv format). Each file is a table in a relational database.

Some tables in the feed are the following:

  • stops.txt: stop locations for bus or train
  • routes.txt: route list with unique identifiers
  • trips.txt: information about each trip by a vehicle
  • stop_times.txt: scheduled arrival and departure times for each stop on each trip.

Before we start with mysql, we write scripts to search the files. The first script will prompt the user for a number and then look whether the number corresponds to a stop id in the file stops.txt. If there is a correspondence, then the name corresponding with the stop id will be printed. A run with the script is below.

$ python3 ctastopname.py
opening CTA/stops.txt ...
give a stop id : 3021
skipping line 0
3021 has name "California & Augusta"

The script looks for the line

3021,3021,"California & Augusta",41.89939053,-87.69688045,0,,1

In a Terminal window, we can type

$ cat stops.txt | grep ",3021,"
3021,3021,"California & Augusta",41.89939053,-87.69688045,0,,1
$

The file stops.txt is saved in the subdirectory CTA of the current directory where the script ctastopname.py is. The code for ctastopname.py is listed next.

FILENAME = 'CTA/stops.txt'
print('opening', FILENAME, '...')
DATAFILE = open(FILENAME, 'r')
STOPID = int(input('give a stop id : '))
COUNT = 0
STOPNAME = None
while True:
    LINE = DATAFILE.readline()
    if LINE == '':
        break
    L = LINE.split(',')
    try:
        if int(L[0]) == STOPID:
            STOPNAME = L[2]
            break
    except:
        print('skipping line', COUNT)
    COUNT = COUNT + 1
print(STOPID, 'has name', STOPNAME)

The other data file we consider is stop_times.txt. For a given stop, we search through that file for the trips the pass at this stop. The script ctastoptimes.py prompts the user for an identification of a stop and then prints all CTA vehicles that stop there.

$ python3 ctastoptimes.py
opening CTA/stop_times.txt ...
give a stop id : 3021
skipping line 0
adding "63rd Pl/Kedzie"
adding "Jackson"
['"63rd Pl/Kedzie"', '"Jackson"']

We scan the lines in stop_times.txt for where the given stop identification occurs. Code for ctastoptimes.py is listed below.

FILENAME = 'CTA/stop_times.txt'
print('opening', FILENAME, '...')
DATAFILE = open(FILENAME, 'r')
STOPID = int(input('give a stop id : '))
COUNT = 0
TIMES = []
while True:
    LINE = DATAFILE.readline()
    if LINE == '':
        break
    L = LINE.split(',')
    try:
        if int(L[3]) == STOPID:
            if L[5] not in TIMES:
                print('adding', L[5])
                TIMES.append(L[5])
    except:
        print('skipping line', COUNT)
    COUNT = COUNT + 1
print(TIMES)

CTA Tables in MySQL

Searching through the files works for an one time lookup, but is inefficient for frequent consultation. A GTFS feed is a series of text files with data on lines separated by commas (csv format). Each file is a table in a relational database. We call our database CTA and will add tables reading the information from stops.txt.

The first line in stops.txt lists nine fields. For each field, we select an appropriate mysql type, as enumerated below:

  1. stop_id: type INT
  2. stop_code: type INT
  3. stop_name: type CHAR(80)
  4. stop_desc: type VARCHAR(128)
  5. stop_lat: type FLOAT
  6. stop_lon: type FLOAT
  7. location_type: type INT
  8. parent_station: type INT
  9. wheelchair_boarding: type SMALLINT

Depending on the setup of mysql, we may have to execute mysql as superuser (or sudo). To make a database CTA, we run mysqladmin at the command prompt.

$ mysqladmin create CTA

Then we start mysql, to create a table in the database:

mysql> use CTA;
Database changed
mysql> create table stops
    -> (id INT, code INT, name CHAR(80),
    -> ndesc VARCHAR(128),
    -> lat FLOAT, lon FLOAT,
    -> tp INT, ps INT, wb SMALLINT);
Query OK, 0 rows affected (0.01 sec)

To double check if the creation of the table stops was a success, we check as follows.

mysql> explain stops;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int(11)      | YES  |     | NULL    |       |
| code  | int(11)      | YES  |     | NULL    |       |
| name  | char(80)     | YES  |     | NULL    |       |
| ndesc | varchar(128) | YES  |     | NULL    |       |
| lat   | float        | YES  |     | NULL    |       |
| lon   | float        | YES  |     | NULL    |       |
| tp    | int(11)      | YES  |     | NULL    |       |
| ps    | int(11)      | YES  |     | NULL    |       |
| wb    | smallint(6)  | YES  |     | NULL    |       |
+-------+--------------+------+-----+---------+-------+
9 rows in set (0.00 sec)

The first data line in stops.txt contains

1,1,"Jackson & Austin Terminal","Jackson & Austin Terminal, Northeastbound, Bus Terminal",41.87632184,-87.77410482,0,,1

We can manually insert this data line as

mysql> insert into stops values
    -> (1,1,"Jackson & Austin Terminal",
    -> "Jackson & Austin Terminal, Northeastbound, Bus Terminal",
    -> 41.87632184,-87.77410482,0,0,1);
Query OK, 1 row affected (0.00 sec)

To check the above insert, we do a simple query.

mysql> select name from stops where id = 1;
+---------------------------+
| name                      |
+---------------------------+
| Jackson & Austin Terminal |
+---------------------------+
1 row in set (0.00 sec)

To delete a row, given its id we execute the query:

mysql> delete from stops where id = 1;
Query OK, 1 row affected (0.65 sec)

mysql> select * from stops;
Empty set (0.01 sec)

If the where clause is omitted, then all rows in the table are deleted.

Typing in 12,162 entries is rather tedious. Therefore, we will fill the table stops with a Python script. After filling the table stops of the database we query the table for a name:

mysql> select name from stops where id = 3021;
+----------------------+
| name                 |
+----------------------+
| California & Augusta |
+----------------------+
1 row in set (0.00 sec)

Our data is on file:

FILENAME = 'CTA/stops.txt'

The script dbctafillstops.py has the function fillstops() defined below.

import pymysql

def fillstops(printonly=True):
    """
    Opens the file with name 'FILENAME', reads every line
    and inserts the data into the table 'stops'.
    """
    if printonly:
        crs = None
    else:
        cta = pymysql.connect(db='CTA')
        crs = cta.cursor()
    print('opening', FILENAME, '...')
    datafile = open(FILENAME, 'r')
    line = datafile.readline()  # skip the first line
    while True:
        line = datafile.readline()
        if(line == ''):
            break
        insert_data(crs, line, printonly)
    if not printonly:
        cta.commit()
        crs.close()
    datafile.close()

For the changes to take effect, we must do a commit(). With rollback(), we can cancel the current transaction, provided the database and tables support transactions.

In extracting data, consider the description of a stop name:

"Jackson & Austin Terminal, Northeastbound, Bus Terminal"

Observe the commas in the string! Splitting the line read from file first on commas may be complicated to process. Therefore, we first split on the double quotes to extract the name and the description of the stop first.

def extract(line):
    """
    Returns a list of 9 elements extracted from the
    string line.  Missing data are replaced by '0'.
    """
    result = []
    strd = line.split('\"') # extract strings first
    (name, desc) = (strd[1], strd[3])
    data = strd[0].split(',')
    result.append('0' if data[0] == '' else data[0])
    result.append('0' if data[1] == '' else data[1])
    result.append(name)
    result.append(desc)
    data = strd[4].split('\n') # remove newline
    data = data[0].split(',')
    for k in range(1, len(data)):
        result.append('0' if data[k] == '' else data[k])
    while len(result) < 9:
        result.append('0')
    return result

The list on return will always have nine items. The insertion of data is performed by the function below.

def insert_data(cur, line, printonly=True):
    """
    Inserts the data in the string line,
    using the cursor c, if printonly is False.
    """
    data = extract(line)
    dbc = 'INSERT INTO stops VALUES ('
    dbc += data[0] + ',' + data[1] + ','
    dbc += '\"' + data[2] + '\",' # name is a string
    dbc += '\"' + data[3] + '\",' # description
    for k in range(4, 8):
        dbc += data[k] + ','
    dbc += data[8] + ')'
    print(repr(dbc))       # print raw string
    if not printonly:
        cur.execute(dbc)

Som queries in the table are shown below.

mysql> select id from stops
    -> where name = "California & Augusta";
+-------+
| id    |
+-------+
|  3021 |
| 17154 |
+-------+
2 rows in set (0.00 sec)

mysql> select name from stops where id = 17154;
+----------------------+
| name                 |
+----------------------+
| California & Augusta |
+----------------------+
1 row in set (0.01 sec)

After the filling of the table stops, we can query for the stop name, given the stop id, as illustrated below.

$ python3 dbctastopquery.py
give a stop id : 3021
3021 has name California & Augusta
$

$ python3 dbctastopquery.py
give a stop id : 0
0 has name None
$

The function main in dbctastopquery.py i is listed below.

import pymysql

def main():
    """
    Connects to the database,
    prompts the user for a stop id
    and the queries the stops table.
    """
    cta = pymysql.connect(db='CTA')
    crs = cta.cursor()
    stop = int(input('give a stop id : '))
    name = get_stop_name(crs, stop)
    print(stop, 'has name', name)
    cta.close()

The execution of the query is defined in the function get_stop_name, shown below.

def get_stop_name(crs, stopid):
    """
    Given a cursor crs to the CTA database,
    queries the stops table for the stop id.
    Returns None if the stop id has not been
    found, otherwise returns the stop name.
    """
    sel = 'SELECT name FROM stops'
    whe = ' WHERE id = %d' % stopid
    query = sel + whe
    returned = crs.execute(query)
    if returned == 0:
        return None
    else:
        tpl = crs.fetchone()
        return tpl[0]

The connections between the stops are defined in the file stop_times.txt. The first line in the file list the following fields, with corresponding mysql type.

  1. trip_id: type BIGINT
  2. arrival_time: type TIME
  3. departure_time: type TIME
  4. stop_id: type INT
  5. stop_sequence: type INT
  6. stop_headsign: type VARCHAR(80)
  7. pickup_type: type INT
  8. shape_dist_traveled: type INT

A BIGINT is a 64-bit integer, while an INT is a 32-bit integer.

To our database CTA we add a new table with the command below.

mysql> create table stop_times
    -> (id BIGINT, arrival TIME, departure TIME,
    -> stop INT, seq INT, head VARCHAR(80),
    -> ptp INT, sdt INT);
Query OK, 0 rows affected (0.02 sec)

Note the types BIGINT and TIME. As a check we execute the explain.

mysql> explain stop_times;
+-----------+-------------+------+-----+---------+-------+
| Field     | Type        | Null | Key | Default | Extra |
+-----------+-------------+------+-----+---------+-------+
| id        | bigint(20)  | YES  |     | NULL    |       |
| arrival   | time        | YES  |     | NULL    |       |
| departure | time        | YES  |     | NULL    |       |
| stop      | int(11)     | YES  |     | NULL    |       |
| seq       | int(11)     | YES  |     | NULL    |       |
| head      | varchar(80) | YES  |     | NULL    |       |
| ptp       | int(11)     | YES  |     | NULL    |       |
| sdt       | int(11)     | YES  |     | NULL    |       |
+-----------+-------------+------+-----+---------+-------+
8 rows in set (0.00 sec)

An example of a query to manually insert a record is given next.

mysql> insert into stop_times values (
    -> 46035893,"12:09:14","12:09:14",6531,29,
    -> "Midway Orange Line",0,18625);
Query OK, 1 row affected (0.00 sec)

mysql> select departure, head from stop_times;
+-----------+--------------------+
| departure | head               |
+-----------+--------------------+
| 12:09:14  | Midway Orange Line |
+-----------+--------------------+
1 row in set (0.00 sec)

The filling of the table on On Mac OS X laptop, happens as follows.

$ python3 dbctafillstoptimes.py
opening CTA/stop_times.txt ...
dbctafillstoptimes.py:26: Warning: Out of range
  value for column 'id' at row 1 c.execute(d)

Redo on a fast Linux Workstation:

# time python3 dbctafillstoptimes.py
opening CTA/stop_times.txt ...
dbctafillstoptimes.py:26: Warning: Out of range
  value for column 'id' at row 1 c.execute(d)

real    5m32.433s
user    1m11.921s
sys     0m17.735s

While running dbctafillstoptimes.py, the memory consumption of Python and mysql was of the same magnitude, about 300Mb. The result is about half a million records, as verified below.

mysql> select count(*) from stop_times;
+----------+
| count(*) |
+----------+
|  5455515 |
+----------+
1 row in set (1.46 sec)

The insertion of the data is defined by the function below.

def insert_data(crs, scsv, printonly=True):
    """
    Inserts the data in the comma separated
    string scsv using the cursor crs.
    """
    data = scsv.split(',')
    cmd = 'INSERT INTO stop_times VALUES ('
    cmd += ('0,' if data[0] == '' else data[0] + ',')
    cmd += '\"' + data[1] + '\"' + ','
    cmd += '\"' + data[2] + '\"' + ','
    cmd += data[3] + ',' + data[4] + ','
    cmd += data[5] + ',' + data[6] + ','
    wrk = data[7] # must cut off the '\n'
    data7 = wrk[0:len(wrk)-1] + ')'
    cmd += ('0)' if data[7] == '' else data7)
    print(repr(cmd))
    if not printonly:
        crs.execute(cmd)

Illustration of querying stop_times are below.

mysql> select head from stop_times
    -> where stop = 3021 and
    -> arrival < "05:30:00";
+----------------+
| head           |
+----------------+
| 63rd Pl/Kedzie |
| 63rd Pl/Kedzie |
| 63rd Pl/Kedzie |
| 63rd Pl/Kedzie |
+----------------+
4 rows in set (0.94 sec)

A more involved query asks for all CTA vehicals which halt at a stop before 5:30AM.

mysql> select name, departure, head
    -> from stops, stop_times
    -> where stops.id = 3021
    -> and stops.id = stop_times.stop
    -> and stop_times.departure < "05:30:00";
+----------------------+-----------+----------------+
| name                 | departure | head           |
+----------------------+-----------+----------------+
| California & Augusta | 04:43:49  | 63rd Pl/Kedzie |
| California & Augusta | 05:03:49  | 63rd Pl/Kedzie |
| California & Augusta | 05:19:49  | 63rd Pl/Kedzie |
| California & Augusta | 05:12:49  | 63rd Pl/Kedzie |
+----------------------+-----------+----------------+
4 rows in set (0.57 sec)

Exercises

  1. Modify ctastopname.py so the user is prompted for a string instead of a number. The modified script prints all id’s and corresponding names that have the given string as substring. Use the in operator.
  2. The file stops.txt contains the latitude and longitude of each stop. Use these coordinates to plot (either with pylab, pyplot, or a Tkinter canvas) the blue line from O’Hare to Forest Park. Use a proper scaling so your plot resembles what we see on a map.
  3. Write a Python script to return the name of stop, given its id, using the table stops.
  4. Design a GUI with Tkinter to query the stop name: one entry field for the stop id, another for the name of the stop, and one button in the middle to execute the query. Note that the GUI allows to query given the stop id or given the stop name.