Oct 7

Lecture Overview

Databases

Databases.

As I mentioned last time, 99.9% of the time we read the entire contents of the file into data structures for processing. When we do not want to read the entire dataset into memory (because it is too large or because we only want to look at and edit a single piece of data), we do not use files. Instead we use a database.

The above three points are the core of databases: we can access individual records and with some planning make our searches efficient.

Example from wikipedia: page table, revision table. Full list of all tables.

Additional Features

While the above three points are the core features, most databases come with some additional helpful features.

Local Databases

The sqlite3 module in python implements all of the above features. The actual data storage is inside a single file. We call functions in the sqlite3 module to search or update records, and the module reads and writes the file (and takes advantage of the ability to read and write only part of the file at a time). It is also careful about the file becoming corrupt when power fails or the program crashes.

We call these local databases because all the processing happens within our program (even if it happens in the sqlite3 module, it is still within our single program). There are other local databases, but sqlite3 is by far the most popular. You can see the list of those that python supports here.

We use these local databases when we want record storage for a single program, for example Firefox uses sqlite to store the history of all websites you have visited (one record for each site). It can then use the efficient searches to determine if a link should be colored blue or purple (a purple link for a visited site).

Network Databases

The second kind of database is a database that runs as a separate program from our program. When we want to search/add/update/remove a record, we send the request to the database over the network. The database then sends the response back to us over the network. Each database usually has its own module which handles this network communication and these do not come as part of the standard python library (they must be installed separately). Example databases of this type are MySQL, MariaDB, CouchDB, PostgresSQL, Oracle, SQLServer, and many others.

These network databases then also support several additional features:

Network databases like the above are useful when there are many programs all wanting to work on the same collection of records. The downside is that network databases usually require quite a bit of configuration for users and permissions and they are on the network so you have to worry about security updates. So for many programs they are overkill and either a local database like sqlite3 or just files containing a file format like JSON are more appropriate.

Exercises