{ "cells": [ { "cell_type": "markdown", "id": "92998f61-7c27-49f7-88b3-b487033a0ed9", "metadata": {}, "source": [ "We illustrate how to work with files, line by line, or byte by byte." ] }, { "cell_type": "markdown", "id": "f54652df-03e8-40d8-b2bd-a5b3ad2b35d7", "metadata": {}, "source": [ "# 1. Reading Line by Line or Byte by Byte" ] }, { "cell_type": "markdown", "id": "3dd0179c-6e65-4587-b4af-606dabbf0e47", "metadata": {}, "source": [ "Consider the file `books.txt`, opened for reading `r`, and with the `utf-8` encoding." ] }, { "cell_type": "code", "execution_count": 1, "id": "f4bd3716-2afc-4eef-8b3c-7b2d4e87ea03", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0:1:The Art & Craft of Computing:\n", "1:2:Making Use of Python:\n" ] } ], "source": [ "with open('books.txt', 'r', encoding='utf-8') as lib:\n", " while True:\n", " line = lib.readline()\n", " if line == '':\n", " break\n", " print(line, end='')\n", " lib.close()" ] }, { "cell_type": "markdown", "id": "53618a5c-aaec-4817-97e2-53a014798de0", "metadata": {}, "source": [ "Observe the suppression of the printing of the newline symbol when we print each line, to avoid the doubling of the newlines." ] }, { "cell_type": "markdown", "id": "ae573ee8-5ace-4ff4-966e-24d545145868", "metadata": {}, "source": [ "Consider the file `books.txt`, opened as a binary file `b`, for reading `r`." ] }, { "cell_type": "code", "execution_count": 2, "id": "ee65e357-77b6-4eeb-9015-43ad1f8ff893", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0:1:The Art & Craft of Computing:\n", "1:2:Making Use of Python:\n" ] } ], "source": [ "with open('books.txt', 'rb') as lib:\n", " while True:\n", " letter = lib.read(1).decode('utf-8')\n", " if letter == '':\n", " break\n", " print(letter, end='')\n", " lib.close()" ] }, { "cell_type": "markdown", "id": "d3c4168b-0713-4185-a861-28fc2b3526d9", "metadata": {}, "source": [ "Observe the decoding of the byte into a letter, immediately after reading." ] }, { "cell_type": "markdown", "id": "747e3701-dd5b-409c-9548-8b967a557b35", "metadata": {}, "source": [ "# 2. Processing Line by Line" ] }, { "cell_type": "markdown", "id": "76e4b26e-6317-43b6-9155-531f9d11da19", "metadata": {}, "source": [ "In the file `books.txt` the information is stored in a rather unconventional form. Let us convert to lists instead of strings with the `:` as a separator. \n", "\n", "The list representation of each book consists of \n", "\n", "1. a boolean (available or not), \n", "\n", "2. the index, and \n", "\n", "3. the title." ] }, { "cell_type": "code", "execution_count": 3, "id": "b0ef6d26-26fb-412b-b6d8-280f3a3aab82", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[False, 1, 'The Art & Craft of Computing']\n", "[True, 2, 'Making Use of Python']\n" ] } ], "source": [ "with open('books.txt', 'r', encoding='utf-8') as lib:\n", " while True:\n", " line = lib.readline()\n", " if line == '':\n", " break\n", " data = line.split(':')\n", " print([data[0] == '1', int(data[1]), data[2]])\n", " lib.close()" ] }, { "cell_type": "markdown", "id": "ac0dd68b-f92c-48fa-8269-f52131928392", "metadata": {}, "source": [ "Now that the output on screen looks good, we will write to a new file `bookslist.txt`." ] }, { "cell_type": "code", "execution_count": 4, "id": "af0b6fae-6a3a-4472-9f98-df01b6fa6f2b", "metadata": {}, "outputs": [], "source": [ "with open('books.txt', 'r', encoding='utf-8') as lib:\n", " with open('bookslist.txt', 'w', encoding='utf-8') as newlib:\n", " while True:\n", " line = lib.readline()\n", " if line == '':\n", " break\n", " data = line.split(':')\n", " newd = str([data[0] == '1', int(data[1]), data[2]])\n", " newlib.write(newd + '\\n')\n", " lib.close()\n", " newlib.close()" ] }, { "cell_type": "markdown", "id": "b57291cd-63d3-46c7-8516-5c46a529aa50", "metadata": {}, "source": [ "Observe the replacement of the `print()` of the list by the `str()` function which returns the string representation of the list." ] }, { "cell_type": "code", "execution_count": 5, "id": "dd9742a7-8200-40b4-9b02-0783fa9d27fa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"[False, 1, 'The Art & Craft of Computing']\\n\", \"[True, 2, 'Making Use of Python']\\n\"]\n" ] } ], "source": [ "with open('bookslist.txt', 'r', encoding='utf-8') as file:\n", " print(file.readlines())\n", " file.close()" ] }, { "cell_type": "markdown", "id": "7bbd577d-03ce-440d-bab2-9379eaf4980f", "metadata": {}, "source": [ "To parse a string representation into a list, we use the `literal_eval` of the `ast` module." ] }, { "cell_type": "code", "execution_count": 6, "id": "193759d2-546b-48f8-b60e-9f033e21a807", "metadata": {}, "outputs": [], "source": [ "from ast import literal_eval" ] }, { "cell_type": "code", "execution_count": 7, "id": "b5cc8375-6562-4518-82fd-b43dc35c0a42", "metadata": {}, "outputs": [], "source": [ "sL = '[1, 2]'" ] }, { "cell_type": "code", "execution_count": 8, "id": "3d73e80b-5d29-4032-b2b0-27f2615867cc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "str" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(sL)" ] }, { "cell_type": "code", "execution_count": 9, "id": "11eab0ca-aaeb-4789-a4c9-b8d9b014c87b", "metadata": {}, "outputs": [], "source": [ "L = literal_eval(sL)" ] }, { "cell_type": "code", "execution_count": 10, "id": "7defa3f6-81ff-40bd-92fc-11b0903a5606", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L" ] }, { "cell_type": "code", "execution_count": 11, "id": "f28bc86f-4090-4c1c-b4f5-59849d233f0a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "list" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(L)" ] }, { "cell_type": "code", "execution_count": 12, "id": "5af63aa5-840f-448c-8eec-ce901e25998b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[False, 1, 'The Art & Craft of Computing'] has type \n", "[True, 2, 'Making Use of Python'] has type \n" ] } ], "source": [ "with open('bookslist.txt', 'r', encoding='utf-8') as lib:\n", " while True:\n", " line = lib.readline()\n", " if line == '':\n", " break\n", " data = literal_eval(line)\n", " print(data, 'has type', type(data))\n", " lib.close()" ] }, { "cell_type": "markdown", "id": "15e6d5bf-5c49-4f91-8df2-654c0ada1068", "metadata": {}, "source": [ "# 3. Processing Character by Character" ] }, { "cell_type": "markdown", "id": "55b9308b-3c8a-4d5f-becb-3a878a67cb86", "metadata": {}, "source": [ "To encrypt a text we can replace letters by other letters. Let us scramble the vowels in a text, in the file `sometext.txt`." ] }, { "cell_type": "code", "execution_count": 13, "id": "ce4694f0-8701-4487-a121-53868a4e3f93", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is a sample text, used as an example\n", "for a message whose vowels will be scrambled.\n" ] } ], "source": [ "with open('sometext.txt', 'r', encoding='utf-8') as file:\n", " while True:\n", " letter = file.read(1)\n", " if letter == '':\n", " break\n", " print(letter, end='')\n", " file.close()" ] }, { "cell_type": "markdown", "id": "256e656e-bb51-4be4-88c1-71ff90b741b4", "metadata": {}, "source": [ "To scramble the text, we use a dictionary. The values in the dictionary are the replacement letters for the characters in the keys. " ] }, { "cell_type": "code", "execution_count": 14, "id": "8d7b1241-8b90-4b97-a3cd-cffd0e6dce06", "metadata": {}, "outputs": [], "source": [ "D = {'a':'e', 'e':'u', 'i':'o', 'o':'a', 'u':'i'}" ] }, { "cell_type": "code", "execution_count": 15, "id": "e201d2a1-0e00-44d9-9568-79d32a25da9b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Thos os e semplu tuxt, isud es en uxemplu\n", "far e mussegu whasu vawuls woll bu scremblud.\n" ] } ], "source": [ "with open('sometext.txt', 'r', encoding='utf-8') as file:\n", " while True:\n", " letter = file.read(1)\n", " if letter == '':\n", " break\n", " if letter not in D:\n", " print(letter, end='')\n", " else:\n", " print(D[letter], end='')\n", " file.close()" ] }, { "cell_type": "markdown", "id": "eb2ec01b-156e-4cb6-b03b-1a54a00b8615", "metadata": {}, "source": [ "Now that we see the correct output, we are confident to write to a new file, called `codetext.txt`." ] }, { "cell_type": "code", "execution_count": 16, "id": "256d8fb6-111c-47a5-b1d5-d722abd8ec53", "metadata": {}, "outputs": [], "source": [ "with open('sometext.txt', 'r', encoding='utf-8') as infile:\n", " with open('codetext.txt', 'w', encoding='utf-8') as outfile:\n", " while True:\n", " letter = infile.read(1)\n", " if letter == '':\n", " break\n", " if letter not in D:\n", " outfile.write(letter)\n", " else:\n", " outfile.write(D[letter])\n", " infile.close()\n", " outfile.close()" ] }, { "cell_type": "code", "execution_count": 17, "id": "d3bf7802-9e3f-4ede-8fcf-4e968f6bd720", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Thos os e semplu tuxt, isud es en uxemplu\\n', 'far e mussegu whasu vawuls woll bu scremblud.\\n']\n" ] } ], "source": [ "with open('codetext.txt', 'r', encoding='utf-8') as file:\n", " print(file.readlines())\n", " file.close()" ] }, { "cell_type": "markdown", "id": "22b47144-adc2-4a4f-887f-6a4f6f153b5f", "metadata": {}, "source": [ "# 4. Using a Buffer to Process a File" ] }, { "cell_type": "markdown", "id": "6050dcbd-f857-47af-838a-90d7d867c93b", "metadata": {}, "source": [ "If we read a text file letter after letter in search for a word, we need to buffer the letters read. If the word we are looking for, such as `the`, has three letters, then the size of the buffer is three." ] }, { "cell_type": "code", "execution_count": 18, "id": "12cff7b6-c8d3-4d86-8a0e-8bece22e5ede", "metadata": {}, "outputs": [], "source": [ "def add_to_buffer(buf, let):\n", " \"\"\"\n", " Adds let to a 3-letter buffer buf,\n", " on return is the updated buffer.\n", " \"\"\"\n", " nbf = buf\n", " if len(buf) == 3:\n", " (nbf[0], nbf[1], nbf[2]) = (nbf[1], nbf[2], let)\n", " else:\n", " nbf.append(let)\n", " return nbf" ] }, { "cell_type": "markdown", "id": "41669952-891c-41c9-975a-b0e33f96faff", "metadata": {}, "source": [ "The buffer is a list, while we shuffle the letters with a tuple assignment." ] }, { "cell_type": "code", "execution_count": 19, "id": "e313d334-a69b-4db6-b92a-c2bc4ac0f155", "metadata": {}, "outputs": [], "source": [ "word = ['t', 'h', 'e']" ] }, { "cell_type": "code", "execution_count": 20, "id": "b4c855ea-ee82-4c99-9716-3737b458b04c", "metadata": {}, "outputs": [], "source": [ "buffer = []" ] }, { "cell_type": "code", "execution_count": 21, "id": "6afb6672-512e-4be2-b599-168d3e48926b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "At the end of the lecture we go home,\n", "taking the bus or the train.\n" ] } ], "source": [ "count = 0\n", "with open('sometext1.txt', 'r', encoding='utf-8') as file:\n", " while True:\n", " letter = file.read(1)\n", " if letter == '':\n", " break\n", " print(letter, end='')\n", " buffer = add_to_buffer(buffer, letter)\n", " if buffer == word:\n", " count = count + 1\n", " file.close()" ] }, { "cell_type": "code", "execution_count": 22, "id": "087f2500-6f15-4c23-a22c-cd5d6d749aa2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "count" ] }, { "cell_type": "markdown", "id": "c3f7d243-21f0-465c-8971-a33a5b78c072", "metadata": {}, "source": [ "# 5. Replacing a Word on File" ] }, { "cell_type": "markdown", "id": "66bec727-5c24-479a-ae48-c81bf9ed598f", "metadata": {}, "source": [ "The previous code to count the number of occurrences of `the` will be used to replace every occurrence of `the` by `one`. For this replacement to work:\n", "\n", "1. The output file must be opened in binary mode.\n", "\n", "2. Bytes read from file must be decoded into strings.\n", "\n", "3. Strings written to file must be encoded into bytes." ] }, { "cell_type": "markdown", "id": "d2c6acac-f537-4174-9afc-ae1d810294fb", "metadata": {}, "source": [ "Let us copy `sometext1.txt` into `sometext2.txt` first, because we will change `sometext2.txt`." ] }, { "cell_type": "code", "execution_count": 23, "id": "0c23e9c3-72ef-4bab-b0c7-0bf3dfb5a168", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "At the end of the lecture we go home,\n", "taking the bus or the train.\n" ] } ], "source": [ "with open('sometext1.txt', 'r', encoding='utf-8') as infile:\n", " with open('sometext2.txt', 'w', encoding='utf-8') as outfile:\n", " while True:\n", " letter = infile.read(1)\n", " if letter == '':\n", " break\n", " print(letter, end='')\n", " outfile.write(letter)\n", " outfile.close()\n", " infile.close()" ] }, { "cell_type": "code", "execution_count": 24, "id": "c6d93d5c-7ac1-4a1c-a389-b233a0d96462", "metadata": {}, "outputs": [], "source": [ "with open('sometext2.txt', 'r+b') as file:\n", " while True:\n", " letter = file.read(1).decode('utf-8')\n", " if letter == '':\n", " break\n", " buffer = add_to_buffer(buffer, letter)\n", " if buffer == word:\n", " file.seek(-3, 1)\n", " file.write(bytes('one', 'utf-8'))\n", " count = count + 1\n", " file.close()" ] }, { "cell_type": "markdown", "id": "a0035648-ee7c-44dd-8567-33456cbb6eae", "metadata": {}, "source": [ "Let us check the content of `sometext2.txt`." ] }, { "cell_type": "code", "execution_count": 25, "id": "8d80efa7-5ae9-4024-a359-bf2869fba038", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['At one end of one lecture we go home,\\n', 'taking one bus or one train.\\n']\n" ] } ], "source": [ "with open('sometext2.txt', 'r', encoding='utf-8') as file:\n", " print(file.readlines())\n", " file.close()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" } }, "nbformat": 4, "nbformat_minor": 5 }