Web Clients and Crawlers
========================

Web Clients
-----------

We do not really need Apache to host a web service.
The client is a browser, e.g.: Netscape, Firefox, ...
We can browse the web using scripts, without a browser..
Why do we want to do this?

1. *more efficient*, without overhead from the GUI,

2. *in control*, we request only what we need,

3. *crawl* the web, conduct a recursive search.

Python provides the ``urllib`` and ``urlparse`` modules.

For example, consider the retrieval of the weather forecast for Chicago.

::

   $ python3 forecast.py

   opening http://tgftp.nws.noaa.gov/data/forecasts/state/il/ilz013.txt ...

      Today    Sat      Sun      Mon      Tue      Wed      Thu      
      Apr 07   Apr 08   Apr 09   Apr 10   Apr 11   Apr 12   Apr 13   

      Chicago Downtown
      SUNNY    SUNNY    MOCLDY   PTCLDY   PTCLDY   SUNNY    MOCLDY   
        /50    38/67    54/72    56/68    45/54    40/48    42/59    
         /00    00/00    00/10    30/30    30/10    10/00    20/40   

      Chicago O'hare
      SUNNY    SUNNY    MOCLDY   PTCLDY   PTCLDY   SUNNY    MOCLDY   
        /55    37/67    54/73    57/69    44/56    39/56    41/60    
         /00    00/00    10/10    40/40    30/10    10/00    30/40   

The script ``forecast.py`` is listed below.

::

   from urllib.request import urlopen
   HOST = 'http://tgftp.nws.noaa.gov'
   FCST = '/data/forecasts/state'
   URL = HOST + FCST + '/il/ilz013.txt'
   print('opening ' + URL + ' ...\n')
   DATA = urlopen(URL)
   while True:
       LINE = DATA.readline().decode()
       if LINE == '':
           break
       L = LINE.split(' ')
       if 'FCST' in L:
           LINE = DATA.readline().decode()
           print(LINE + DATA.readline().decode())
       if 'Chicago' in L:
           LINE = LINE + DATA.readline().decode()
           LINE = LINE + DATA.readline().decode()
           print(LINE + DATA.readline().decode())

The processing of a web page is similar to processing a file.
As example, consider the copying of a web page to a file.
The syntax of the ``urlretrieve`` is

::

   urlretrieve( < URL >, < file name > )

For example,

::

   from urllib.request import urlretrieve
   urlretrieve('http://www.python.org','wpt.html')

The above statements copy the page at ``http://www.python.org``
to the file ``wpt.html``.
To practice the tools to browse web pages with a script,
we will do the same as ``urlretrieve`` does,
reading the page in small increment.

First, we open a web page with ``urllib.request.urlopen``,
its syntax is below.

::

   from urllib.request import urlopen
   < object like file > = urlopen( < URL > )

Then we read data, with the application of ``read`` method:

::

   data = < object like file >.read( < size > ).decode()

where ``size`` is the number of bytes in the buffer.
The ``read`` returns a sequence of bytes.
To turn the sequence of bytes into a string,
we have to apply the ``decode()`` method.
After reading, we close the page:

::

  < object like file >.close()

The ``open``, ``read``, ``close`` methods
are similar to the methods on files.

The opening of a web page is surrounded by an exception handler
in the code below.

::

   def main():
       """
       Prompts the user for a web page,
       a file name, and then starts copying.
       """
       from urllib.request import urlopen
       print('making a local copy of a web page')
       url = input('Give URL : ')
       try:
           page = urlopen(url)
       except:
           print('Could not open the page.')
           return
       name = input('Give file name : ')
       copypage(page, name)

The function to copy a web page to a file is defined below.

::

   def copypage(page, file):
       """
       Given the URL for the web page,
       a copy of its contents is written to file.
       Both url and file are strings.
       """
       copyfile = open(file, 'w')
       while True:
           try:
               data = page.read(80).decode()
           except:
               print('Could not decode data.')
               break
           if data == '':
               break
           copyfile.write(data)
       page.close()
       copyfile.close()

Scanning Files
--------------

The web pages we download are formatted HTML files.
Applications to scan an HTML file are for example:

1. search for particular information,
   for example download all ``.py`` files from the course web site;

2. navigate to where the page refers to,
   for example, retrieve all URLs the page ``www.python.org``
   refers to.

What is common between these two examples:
``.py`` files and URLs appear between
double quotes in the files.
So we will scan a file for all strings between double quotes.
The problem statement is

*  Input: a file, or object like a file.

*  Output:  list of all strings between double quotes.

Recall that we read files with fixed size buffer,
as illustrated in :numref:`figfixedbuffer`.

.. _figfixedbuffer:

.. figure:: ./figfixedbuffer.png
    :align: center

    Reading files with a buffer of fixed size.

For double quoted strings which run across two buffers
we need another buffer.
So we have to manage two buffers,
one for reading strings from file, and 
another for buffering double quoted string.
We will have two functions,
one to read buffered data from file, and
another to scan the data buffer for double quoted strings.

The code to read strings from file is listed below.

::

   def quoted_strings(file):
       """
       Given a file object, this function scans
       the file and returns a list of all strings
       on the file enclosed between double quotes.
       """
       result = []
       buffer = ''
       while True:
           data = file.read(80)
           if data == '':
               break
           (result, buffer) = update_qstrings(result, buffer, data)
       return result


We perform a buffered reading of the file.
In ``acc`` we store the double quoted strings.
In ``buf`` we buffer the double quoted strings.
In :numref:`figprocessbuffer`, every dot ``.`` represents a character.

.. _figprocessbuffer:

.. figure:: ./figprocessbuffer.png
    :align: center

    Processing strings with two buffers.

In ``quoted_strings`` we make the following call:

::

   (result, buffer) = update_qstrings(result, buffer, data)

Code for the function is listed below.

::

   def update_qstrings(acc, buf, data):
       """
       acc is a list of double quoted strings,
       buf buffers a double quoted string, and
       data is the string to be processed.
       Returns an updated (acc, buf).
       """
       newbuf = buf
       for char in data:
           if newbuf == '':
               if char == '\"':
                   newbuf = 'o' # 'o' is for 'opened'
           else:
               if char != '\"':
                   newbuf += char
               else:       # do not store 'o'
                   acc.append(newbuf[1:len(newbuf)])
                   newbuf = ''
       return (acc, newbuf)

The function ``main()`` is defined below.

::

   def main():
       """
       Prompts the user for a file name and
       scans the file for double quoted strings.
       """
       print('getting double quoted strings')
       name = input('Give a file name : ')
       file = open(name, 'r')
       strs = quoted_strings(file)
       print(strs)
       file.close()

Recall the second example application:
list all URLs referred to at ``http://www.python.org``
so we need to scan the web pages for URLs.

::

   def main():
       """
       Prompts the user for a web page,
       and prints all URLs this page refers to.
       """
       print('listing reachable locations')
       page = input('Give URL : ')
       links = httplinks(page)
       print('found %d HTTP links' % len(links))
       show_locations(links)

The filtering of double quoted strings 
and extracting the URLs starts with the function below.

::

   from scanquotes import update_qstrings

   def httpfilter(strings):
       """
       Returns from the list strings only
       those strings which begin with http.
       """
       result = []
       for name in strings:
           if len(name) > 4:
               if name[0:4] == 'http':
                   result.append(name)
       return result

In the function ``httplinks``, we first open the URL
and then we read that page in search for double quoted strings.

::

   def httplinks(url):
       """
       Given the URL for the web page,
       returns the list of all http strings.
       """
       from urllib.request import urlopen
       try:
           print('opening ' + url + ' ...')
           page = urlopen(url)
       except:
           print('opening ' + url + ' failed')
           return []

       (result, buf) = ([], '')
       while True:
           try:
               data = page.read(80).decode()
           except:
               print('could not decode data')
               break
           if data == '':
               break
           (result, buf) = update_qstrings(result, buf, data)
           result = httpfilter(result)
       page.close()
       return result

An URL consists of 6 parts

::

   protocol://location/path:parameters?query#frag 

Given a URL ``u``, the ``urlparse(u)`` returns a 6-tuple.

::

   def show_locations(links):
       """
       Shows the locations of the URLs in links.
       """
       from urllib.parse import urlparse
       for url in links:
           pieces = urlparse(url)
           print(pieces[1])

Web Crawlers
------------

Web crawlers make requests recursively.
We scang HTML files and browse as follows:

1. given a URL, open a web page,

2. compute the list of all URLs in the page,

3. for all URLs in the list do:

   1. open the web page defined by location of URL,

   2. compute the list of all URLs on that page.

   then continue recursively, *crawling* the web.

Some things we have to consider:

1. remove duplicates from list of URLs,

2. do not turn back to pages visited before,

3. limit the levels of recursion,

4. some links will not work.

This is very similar to finding a path in a maze,
but now we are interested in all intermediate 
nodes along the path.

The running of the crawler is illustrated below.

::

   $ python webcrawler.py
   crawling the web ...
   Give URL : http://www.uic.edu
   give maximal depth : 2
   opening http://www.uic.edu ...
   opening http://maps.uic.edu ...
   could not decode data
   opening http://maps.google.com ...
   opening http://maps.googleapis.com ...
   opening http://maps.googleapis.com failed
   opening http://fimweb.fim.uic.edu ...

   .. it takes a while .. 

   total #locations : 3954  

In 2010: ``#locations : 538``

In a modular design of the code for the crawler,
we start with

::

   from scanhttplinks import httplinks

We still are left to write:
code to manage the list of server locations, and
the recursive function to crawl the web.

To retain only new Locations, we filter the list
of links with the function defined below.

::

   def new_locations(links, visited):
       """
       Given the list links of new URLs and the
       list of already visited locations,
       returns the list of new locations,
       locations not yet visited earlier.
       """
       from urllib.parse import urlparse
       result = []
       for url in links:
           parsed = urlparse(url)
           loc = parsed[1]
           if loc not in visited:
               if loc not in visited:
                   result.append(loc)
       return result

Recall that we store only the server locations.
To open a web page we also need to specify the protocol.
We apply ``urlparse.urlunparse`` as follows:

::

   >>> from urlparse import urlunparse
   >>> urlunparse(('http','www.python.org',
   ... '','','',''))
   'http://www.python.org'

We must provide a 6-tuple as argument.

The function ``main()`` is defined below.

::

   def main():
       """
       Prompts the user for a web page,
       and prints all URLs this page refers to.
       """
       print('crawling the web ...')
       page = input('Give URL : ')
       depth = int(input('give maximal depth : '))
       locations = crawler(page, depth, [])
       print('reachable locations :', locations)
       print('total #locations :', len(locations))

The code for the crawler is provided in the function below.

::

   def crawler(url, k, visited):
       """
       Returns the list visited updated with the
       list of locations reachable from the
       given url using at most k steps.
       """
       from urllib.parse import urlunparse
       links = httplinks(url)
       newlinks = new_locations(links, visited)
       result = visited + newlinks
       if k == 0:
           return result
       else:
           for loc in newlinks:
               url = urlunparse(('http', loc, '', '', '', ''))
               result = crawler(url, k-1, result)
           return result

Exercises
---------

1. Write a script to download all ``.py`` files from
   the course web site.

2. Limit the search of the crawler so that it only opens
   pages within the same domain.  For example, if we
   start at a location ending with ``edu``, we only open
   pages with locations ending with ``edu``.

3. Adjust ``webcrawler.py`` to search for a path
   between two locations.  The user is prompted for two URLs.
   Crawling stops if a path has been found.

4. Write an iterative version for the web crawler.

5. Use the stack in the iterative version of the crawler from
   the previous exercise to define a tree of all locations that
   can be reached from a given URL.