Oct 7

Lecture Overview

Databases

As I mentioned last time, 99.9% of the time we read the entire contents of the file into data structures for processing. When we do not want to read the entire dataset into memory (because it is too large or because we only want to look at and edit a single piece of data), we do not use files. Instead we use a database.

Data in a database is organized into collections records. Each record consists of some attributes and each attribute has a name and a value. For example, each account in a bank can be one record where the attributes are account number, balance, name on account, account type, etc. Or in our library example from last time, each record could a book.
The database allows us to manipulate each record individually. The main operations on a database are search for records, insert a new record, update a record, and delete a record.
If we tell the database the types of searches we will do ahead of time (for example, we will most likely be searching bank account records by account number), the database can prepare the data in such a way that these searches are preformed fast.

The above three points are the core of databases: we can access individual records and with some planning make our searches efficient.

Example from wikipedia: page table, revision table. Full list of all tables.

Additional Features

While the above three points are the core features, most databases come with some additional helpful features.

The format for each record can be enforced to be a specific format, i.e. the database can enforce that every bank account record has an account number. In a perfect world, this wouldn't be needed because no software would have bugs, but in the real world the program we write might have a bug (e.g. it would try and add an account without an account number). The database will disallow this so we can more easily detect this bug instead of just successfully adding an invalid record into the database. This is called the data model for the database.
The database guarantees the data does not become corrupted, even in the presence of power failures, the buffering done by the kernel, and any other failure scenario. The basic example here is transferring money between bank accounts. If we naively update one account record to remove the money from the balance, but power fails before we update the second account, we have lost money. The database is more clever and no matter when the power fails either both accounts were updated or neither. These guarantees are ACID.

Local Databases

The sqlite3 module in python implements all of the above features. The actual data storage is inside a single file. We call functions in the sqlite3 module to search or update records, and the module reads and writes the file (and takes advantage of the ability to read and write only part of the file at a time). It is also careful about the file becoming corrupt when power fails or the program crashes.

We call these local databases because all the processing happens within our program (even if it happens in the sqlite3 module, it is still within our single program). There are other local databases, but sqlite3 is by far the most popular. You can see the list of those that python supports here.

We use these local databases when we want record storage for a single program, for example Firefox uses sqlite to store the history of all websites you have visited (one record for each site). It can then use the efficient searches to determine if a link should be colored blue or purple (a purple link for a visited site).

Network Databases

The second kind of database is a database that runs as a separate program from our program. When we want to search/add/update/remove a record, we send the request to the database over the network. The database then sends the response back to us over the network. Each database usually has its own module which handles this network communication and these do not come as part of the standard python library (they must be installed separately). Example databases of this type are MySQL, MariaDB, CouchDB, PostgresSQL, Oracle, SQLServer, and many others.

These network databases then also support several additional features:

multiple programs can all communicate with the database over the network at once, and the database program will respond to all the requests using the same collection of records. For example, the database at a bank storing the accounts can be accessed from the programs running on the bank teller machines plus the programs running on each ATM. The database will provide protection so that multiple programs do not attempt to edit the same record at the same time.
Users, permissions, and security: the database can be configured so that only certain users can access certain records.

Network databases like the above are useful when there are many programs all wanting to work on the same collection of records. The downside is that network databases usually require quite a bit of configuration for users and permissions and they are on the network so you have to worry about security updates. So for many programs they are overkill and either a local database like sqlite3 or just files containing a file format like JSON are more appropriate.

Exercises

Download this csv which contains data on all earthquakes in the past week more than magnitude 2.5. I found that link from this page.
Write a python script to load the csv file using the csv module. You will just need to use rows = csv.reader(f) and then for row in rows like in the simplest example.
Your script should do two things with the data:
- compute the average magnitude of all earthquakes.
- count how many earthquakes occurred in the ring of fire For this exercise, just estimate this by counting the earthquakes with longitudes greater than 130 degrees (which is from Japan to the date line) and smaller than -70 degrees (from near the coast of South America to the date line). Print out the number in the ring of fire, the total number of earthquakes anywhere, and the percentage of earthquakes in the ring of fire.