L-41 MCS 360 Friday 25 April 2003

Below is a very brief summary of what we talked about in class. If you missed the lecture, then what is below may guide your reading of the text book.

Hashing

A hash function maps a key to a position in the table.

1. Desirable Properties

We identified 4 properties we desire from a hash function:
  1. fast to evaluate, in uniform time for every key; (the main goal is to have fast uniform access to the data)
  2. a hash function h is said to be perfect if for all keys i /= j, we have h(i) /= h(j); (we want to avoid collisions or hash clashes)
  3. ideally, we want a hash function to be minimal, where the range of the hash function to map #K keys equals the size of the table; (we wish to avoid wasting memory)
  4. order preserving: for i < j, we have h(i) < h(j); (in case we want to traverse the table in the order of the keys)

2. Techniques to create hash functions

We survey three techniques to build hash functions:
  1. Selecting digits from the keys
  2. Folding, e.g.: adding all digits in the key
  3. Modular arithmetic, e.g.: h(k) = k modulo size(table)
Usually a hash function will use a combination of these three techniques. An interesting one uses a random vector of doubles: we choose first once and for all one random vector and keep it fixed during hashing. The hash function takes the inner product of this random vector with the digits in the keys. With probability one, all values will be different. Selecting the leading digits and modulo size(table) will yield the hash function. It is a very interesting exercise to experiment with this hash function.

3. Dealing with collisions

Collisions or hash clashes are usually unavoidable (unless we have studied our set of keys really well). There are two methods for dealing with collisions:
  1. having buckets of records (see radix sort and address calculation); the book shows techniques to deal with overflowing buckets
  2. linear hashing or rehashing, one simple example is rh(k,i) = (h(k) + i) modulo size(table), where i is the i-th rehashing (beware of infinite loops!)