Counting Words and Pattern Matching
===================================

Dictionaries are very important data structure.
A typical application is build a frequency table.
We illustrate this application on a downloaded text.

The builtin module ``re`` provides tools to match
strings for regular expressions.

Dictionaries
------------

The project Gutenberg at <http://www.gutenberg.org>
is a good source for many texts in the public domain.
The free books are available as plain `.txt` files.
As our example we take the 1898 novel of H.C. Wells,
the war of the worlds, with its cover in :numref:`figwarofworlds`.

.. _figwarofworlds:

.. figure:: ./figwarofworlds.png
    :align: center

    Cover of a book in the public domain.

In our course, we are not reading the book for its literary qualities,
but we use the text file to count the frequencies of its words.

In our analysis, we assume we have downloaded the entire text
as a ``.txt`` file.  If we wanted to download the book line by line,
then we could use the following script:

::

   URL = 'http://www.gutenberg.org/files/36/36.txt'

   from urllib.request import urlopen

   INFILE = urlopen(URL)
   while True:
       LINE = INFILE.readline()
       if LINE == b'':
           print('End of file reached.')
           break
       print(LINE)
       ans = input('Get another line ? ')
       if ans != 'y':
           break

In processing a text file, 
we want to answer the following questions:

1. How many different words are there in the text?

2. What is the frequency ofr each word?

3. Given a frequency, how many words occur with this frequency?

4. What words occur 100 times or more?

To us, anything separated by one or more spaces is a word.

The analysis on the war of the worlds text yields a total
word count of 66,491.  There are 13,003 different words.
Words used more than 100 times, as in the output of the script:

::

   (4076, ['the'])
   ...
   (119, ['there'])
   (116, ['people', 'And'])
   (114, ['an'])
   (113, ['Martians'])
   (111, ['saw', 'through'])
   ...

The algorithm to count the number of words is pictured
in :numref:`figcountwords`.  Observe the repeat-until construction.

.. _figcountwords:

.. figure:: ./figcountwords.png
    :align: center

    Counting words in a text on file.

The code to scan a file starts below:

::

   BOOK = "war_of_the_worlds.txt"

   def word_count(name):
       """
       Opens the file with name and counts the
       number of words.  Anything that is separated
       by spaces is considered a word.
       """
       file = open(name, 'r')
       count = 0
       while True:
           line = file.readline()
           if line == '':
               break
           words = line.split(' ')
           count = count + len(words)
       file.close()
       return count

For the frequency count, the proper data structure is a dictionary.
A dictionary in Python is 

* a set of ``key:value`` pairs
  any type goes for ``value``
  but ``key`` must belong to an ordered type;

* a hash table where order of elements allows for fast access.

For the frequency table in our application, we have

* type of key: ``str``,

* type of value: ``int``.

An example of a ``key:value`` pair is ``'the': 4076``.

To introduce the use of dictionaries,
consider the following interactive Python session:

::

   >>> D = {}
   >>> D['s'] = 1
   >>> 's' in D
   True
   >>> 't' in D
   False
   >>> D['t'] = 2
   >>> D.items()
   dict_items([('s', 1), ('t', 2)])
   >>> D.values()
   dict_values([1, 2])
   >>> D.keys()
   dict_keys(['s', 't'])

Useful constructions on dictionaries are listed
in :numref:`tabconsdict`.

.. _tabconsdict:

.. table:: working with dictionaries

   +-----------------------+---------------------------------------+
   | Python construction   | what it means                         |
   +=======================+=======================================+
   | ``D = { }``           | initialization                        |
   +-----------------------+---------------------------------------+
   | ``D[<key>] = <value>``| add a ``key:value`` pair              |
   +-----------------------+---------------------------------------+
   | ``D[<key>]``          | selection of value, given ``key``     |
   +-----------------------+---------------------------------------+
   | ``<key> in D``        | ``True`` if ``D[<key>]`` exists       |
   +-----------------------+---------------------------------------+
   | ``D.items()``         | ``dict_items`` of tuples (key, value) |
   +-----------------------+---------------------------------------+
   | ``D.keys()``          | returns ``dict_keys`` of all keys     |
   +-----------------------+---------------------------------------+
   | ``D.values()``        | returns ``dict_values`` of all values |
   +-----------------------+---------------------------------------+

The computation of the dictionary of word frequencies
is done by the function below.

::

   def word_frequencies(name):
       """
       Returns a dictionary with the frequencies
       of the words occurring on file with name.
       """
       file = open(name, 'r')
       result = {}
       while True:
           line = file.readline()
           if line == '':
               break
           words = line.split(' ')
           for word in words:
               if word in result:
                   result[word] += 1
               else:
                   result[word] = 1
       file.close()
       return result

With the command ``freq = word_frequencies('book')``
we see how many times each word occurs,
for example: ``freq['the']`` returns 4076.
But we want to know the most frequent words,
that is: we want to query ``freq`` on the values.
We revert the dictionary:

1. the keys are the frequency counts,

2. because words occur more than once, the values are lists.

For example: ``invfreq[295]`` will be ``['for', 'from']``,
if ``invfreq`` is the reverted ``freq``.
The computation of this dictionary where the keys are
the frequences is defined by the function ``frequencies_of_words``.

::

   def frequencies_of_words(freq):
       """
       Reverts the keys and values of the
       given dictionary freq.  Because several
       words may occur with the same frequency,
       the values are lists of words.
       """
       result = {}
       for key in freq:
           if freq[key] in result:
               result[freq[key]].append(key)
           else:
               result[freq[key]] = [key]
       return result

Recall that our original question was to find all
words used more than 100 times.
To answer this question, we need to sort the dictionary items.
But we cannot sort a dictionary, only lists we can sort.
Therefore we take the items in the dictionary and store
those items in a list.
The ``items()`` method on any dictionary returns
a list of tuples:

::

   >>> L = list(D.items())
   >>> L
   [('s', 1), ('t', 2)]

To sort on a key, from high to low:

::

   >>> L.sort(key=lambda i: i[1],reverse=True)
   >>> L
   [('t', 2), ('s', 1)]

The main program is listed below.

::

   def main():
       """
       Analysis of words in a book.
       """
       cnt = word_count(BOOK)
       print('words counted :', cnt)
       freq = word_frequencies(BOOK)
       print('number of different words :', len(freq))
       invfreq = frequencies_of_words(freq)
       lstfreq = list(invfreq.items())
       lstfreq.sort(key=lambda e: e[0], reverse=True)
       print("words used more than 100 times :")
       for item in lstfreq:
           if item[0] < 100:
               break
           print(item)


Pattern Matching
----------------

Manipulating text and strings is an important task.
When parsing data we must ensure the entered data is correct.
In search through confidential data, it is best to use program
to maintain confidentiality.

For example, suppose ``answer`` contains the answers 
to a yes or no question.
Acceptable ``yes`` answers are
``y`` or ``yes``, ``Y`` or ``Yes``.
Testing all these four cases is tedious.

Python offers support for regular expressions,
with the module ``re``, which is a standard library module.

Consider the following interactive session,
to match short and long answers.

::

   >>> import re
   >>> (short, long) = ('y', 'Yes')
   >>> re.match('y',short) != None
   True
   >>> re.match('y',long) != None
   False
   >>> re.match('y|Y',long) != None
   True
   >>> re.match('y|Y', long) 
   <_sre.SRE_Match object; span=(0, 1), match='Y'>
   >>> re.match('y|Y',long).group()
   'Y'
   >>> re.match('y|Y',short).group()
   'y'

The function ``match()`` in the ``re`` module
has the following specification:

::

   re.match( < pattern > , < string > )

There are two possible outcomes:

* If the ``string`` does not match the ``pattern``,
  then ``None`` is returned.

* If the ``string`` matches the ``pattern``,
  then a match object is returned.

For example, consider

::

   >>> re.match('he','hello')
   <_sre.SRE_Match object at 0x5cb10>
   >>> re.match('hi','hello') == None
   True

What can we do with the match object?
Well, we can apply the ``group()`` method, consider

::

   >>> re.match('he','hello')
   <_sre.SRE_Match object at 0x5cb10>
   >>> _.group()
   'he'

After a successful match, ``group()`` returns
that part of the pattern that matches the string.

The match only works from the start of the string.

::

   >>> re.match('ell','hello') == None
   True

To look for the first occurrence of the pattern
in the string, we have to use ``search()``, as below.

::

   >>> re.search('ell','hello')
   <_sre.SRE_Match object at 0x5cb10>
   >>> _.group()
   'ell'

Regular Expressions
-------------------

A regular expression defines a pattern to match a string with,
introduced in :numref:`tabliteralor`.

.. _tabliteralor:

.. table:: literals and the or symbol

   +-------------+------------------------------------------+
   | pattern     | strings matched                          |
   +=============+==========================================+
   | ``literal`` | strings starting with ``literal``        |
   +-------------+------------------------------------------+
   | ``re1|re2`` | strings starting with ``re1`` or ``re2`` |
   +-------------+------------------------------------------+

Consider the following session.

::

   >>> from time import ctime
   >>> now = ctime()
   >>> now
   'Wed Jan 27 09:43:50 2016'
   >>> p = ...
   '\w{3}\s\w{3}\s\d{2}\s\d{2}:\d{2}:\d{2}\s\d{4}'
   >>> re.match(p,now) != None
   True

.. _tabalphadigits:

.. table:: matching characters and digits

   +-----------+--------------------------------------------------+
   | pattern   | strings matched                                  |
   +===========+==================================================+
   | ``\w``    | any alphanumeric character, same as ``[A-Za-z]`` |
   +-----------+--------------------------------------------------+
   | ``\d``    | any decimal digit, same as ``[0-9]``             |
   +-----------+--------------------------------------------------+
   | ``\s``    | any whitespace character                         |
   +-----------+--------------------------------------------------+
   | ``re{n}`` | ``n`` occurrences of ``re``                      |
   +-----------+--------------------------------------------------+

We can match 0 or 1 occurrences.
Consider the following example.
We allow ``Ms.``, ``Mr.``, ``Mrs.``, with or without the ``.`` (dot).

::

   >>> title = 'Mr?s?\.? '

There are three symbols in the string ``title``.

* ``?`` matches 0 or 1 occurrences

* ``.`` matches any character

* ``\.`` matches the dot ``.``

Consider the following examples.

::

   >>> re.match(title,'Ms ') != None
   True
   >>> re.match(title,'Ms. ') != None
   True
   >>> re.match(title,'Miss ') != None
   False
   >>> re.match(title,'Mr') != None
   False
   >>> re.match(title,'Mr ') != None
   True
   >>> re.match(title,'M ') != None
   True

We can match with specific characters.
A name has to start with upper case, consider

::

   >>> name = '[A-Z][a-z]*'
   >>> G = 'Guido van Rossum'
   >>> re.match(name,G)
   >>> _.group()
   'Guido'
   >>> g = 'guido'
   >>> re.match(name,g) == None
   True

Groups of regular expressions are designated
with parenthesis, between ``(`` and ``)``.
The symtax is as follows.

::

   < pattern > = ( < group1 > ) ( < group2 > )
   m = re.match( < pattern > , < string > )
   if m != None: m.groups()

After a successful match, ``groups()`` returns a tuple of
those parts of the string that matched the pattern.

As an application, consider the extraction of
hours, seconds, and minutes,
using the ``groups()`` method:

::

   >>> import re
   >>> from time import ctime
   >>> now = ctime()
   >>> now
   'Wed Jan 27 09:51:28 2016'
   >>> t = now.split(' ')[3]
   >>> t
   '09:51:28'
   >>> format = '(\d\d):(\d\d):(\d\d)'
   >>> m = re.match(format, t)
   >>> m.groups()
   ('09', '51', '28')
   >>> (hours, minutes, seconds) = _

Exercises
---------

1. Words may contain ``\n`` or other special symbols.
   Modify the code to first strip the word of special symbols
   and to convert to lower case before updating the dictionary.

2. Modify the script ``wordswardict.py`` to count letters
   instead of words.

3. Download of 2 different authors 2 different texts 
   from <http://www.gutenberg.org>.  Do the word frequencies
   tell which texts are written by the same author?

4. Write a regular expression to match all words that start
   with ``a`` and end with ``t``.

5. Modify ``wordswardict.py`` so that it prompts the user for a 
   regular expression and then builds a frequency table for those 
   words that match the regular expression.