Sep 18

Methods

In addition to functions, python has methods. We will talk more in detail about methods in a couple months when we get to object-orientated programming. For now, just the basics. A method is like a function in that it has a list of parameters, a body, a return value, and can be called. But where a function is stored in a variable, a method is attached to a data value. Here is an example method from the python standard library:

x = "Hello, World"
print(x.upper())

This uses the upper method on strings, which returns a copy of the string except with lower case letters converted to upper case. The syntax of calling looks similar to using a function from a module, but they are actually different. When python sees an expression like foo.bar(3), it looks up what foo is. If it is a module name, then it looks in the module for a function bar and then calls it. If foo is a variable holding a data value, python looks for a method bar attached to this data value. These are closely related but slightly different. Methods can also take parameters, for example replace.

x = "Hello, World"
print(x.replace("W", "abcd"))

The python standard library has a large collection of methods, so if you find yourself needing to transform or manipulate a data value, you should check the python documentation to see if there is a method that does that already. For now we will just use these methods; we will learn how to define our own methods in a few months.

Dictionaries

Frequency Analysis

Frequency Analysis is the technique to breaking the subsititution cipher. To do frequency analysis, we will build up several tools. Each tool will be a python function.

The first tool we need to a function to count up the number of times each letter occurs.

def letter_count(text):
    text = text.lower()
    alphabet = "abcdefghijklmnopqrstuvwxyz"

    # The frequencies will be stored in a dictionary with the letter as the key
    # and the number of occurences as the value.
    l_count = {}

    for c in text:
        if c in alphabet:
            # Retrieve the old count for this character.  The get
            # method will return the value if the key exists, otherwise
            # return zero.
            oldcount = l_count.get(c, 0)

            # Now store the oldcount + 1 as the new count
            l_count[c] = oldcount + 1

    return l_count

# Some Test Code
print(letter_count("The Magic Words are Squeamish Ossifrage"))

The above code computes a dictionary. We now want to sort this by the count, so we can start guessing that the high frequency letters are e, t, and so on. See the python HOWTO on sorting. We will use the sorted function. The sorted function takes a list and a comparison function. First, we use the items method on dictionaries. This method converts the dictionary to a list of pairs, each pair is one (key, value). This list we will pass to the sorted function. Next we need a key extraction function. This is a function which is used by the sorted function to determine the correct ordering in the list. It must take a single parameter and return a value (called the key). Python uses this key value to determine the sort order.

def extract_key(a):
    # a is a tuple of size two, with the first entry the key and the
    # second entry the value.  Therefore, a[1] is the key.
    return a[1]

def sort_counts(l_count):
    l_list = l_count.items()
    return sorted(l_list, key=extract_key)

# Some test code
l_count = letter_count("The Magic Words are Squeamish Ossifrage")
print(l_count.items())  # Just to see the list of items
print(sort_counts(l_count)) # The list should now be sorted

Now we are going to make guesses at which letters substitute for which other letters. To do that, we will build a dictionary with our guesses and a function which will replace our guesses.

def try_decrypt(guesses, ciphertext):
    plaintext = ""
    for c in ciphertext:
    
        # This checks if c is a key within the guesses dictionary
        if c in guesses:
            plaintext += guesses[c]
        else:
            # Just copy it directly
            plaintext += c
    return plaintext

# Some test code
guesses = {"s": "W", "t": "C"}
print(try_decrypt(guesses, "test test"))

An Example

Here is a message I encrypted using the code from Day 12.

rnkcnuoiwrzsxsvnuornkcnuovbwoxwqsdaroktozsrrslwoiwfxdrovbsvow
ahmqfwxwiovbwonkcwxozsxvoktovbwoikkxcsuocsroxwqkewiocdvbovxwq
fndalobsairodoqsiwosovdauofxwshbodaovbwomzzwxonwtvobsaiohkxaw
xosaiovbwaocdiwadalovbwobknwosondvvnwododarwxvwiovbwohsainwos
aiozwwxwiodaovbwobkvosdxowrhszdalotxkqovbwohbsqfwxohsmrwiovbw
otnsqwovkotndhjwxofmvozxwrwavnuoiwvsdnroktovbwoxkkqocdvbdaowq
wxlwiotxkqovbwoqdrvooohsaoukmorwwosauvbdaloy

You can also download it as a file ciphertext.txt.

The first thing is to compute the frequencies. To do so, I created a python file with all the above functions (and none of the test code) and added at the bottom the following. The open function opens a file and the read method returns the contents of the file as a string.

ciphertext = open("ciphertext.txt", "r").read()
print(sort_counts(letter_count(ciphertext)))

From this, we see that o is the most frequent, followed by w, and then v. So lets guess that o is space, w is E, and v is T. Notice how I use capital letters to help me see which letters I have decoded and which are still ciphertext.

guesses = { "o" : " ", "w" : "E", "v" : "T"}
print(try_decrypt(guesses, ciphertext))

The result is

rnkcnu iErzsxsTnu rnkcnu TbE xEqsdar kt zsrrslE iEfxdr TbsT EahmqfExEi
TbE nkcEx zsxT kt TbE ikkxcsu csr xEqkeEi cdTb TxEqfndal bsair d qsiE s
Tdau fxEshb da TbE mzzEx nEtT bsai hkxaEx sai TbEa cdiEadal TbE bknE s
ndTTnE d darExTEi TbE hsainE sai zEExEi da TbE bkT sdx Erhszdal txkq TbE
hbsqfEx hsmrEi TbE tnsqE Tk tndhjEx fmT zxErEaTnu iETsdnr kt TbE xkkq
cdTbda EqExlEi txkq TbE qdrT   hsa ukm rEE sauTbdal y

My space guess looks very good since it looks somewhat like words. Also, the T and E look like very good guesses since you see TbE many times: lets guess that is THE.

guesses = { "o" : " ", "w" : "E", "v" : "T", "b" : "H"}
print(try_decrypt(guesses, ciphertext))
rnkcnu iErzsxsTnu rnkcnu THE xEqsdar kt zsrrslE iEfxdr THsT EahmqfExEi THE
nkcEx zsxT kt THE ikkxcsu csr xEqkeEi cdTH TxEqfndal Hsair d qsiE s Tdau
fxEshH da THE mzzEx nEtT Hsai hkxaEx sai THEa cdiEadal THE HknE s ndTTnE
d darExTEi THE hsainE sai zEExEi da THE HkT sdx Erhszdal txkq THE hHsqfEx
hsmrEi THE tnsqE Tk tndhjEx fmT zxErEaTnu iETsdnr kt THE xkkq cdTHda EqExlEi
txkq THE qdrT   hsa ukm rEE sauTHdal y

Lets go back to the frequencies. The next most used letters in the ciphertext are s,x,a,b,d, all roughly equal. After E, T, the next most frequent letters are A,H,I,N,O,R,S all roughly equal. So lets try several guesses for s,x,a,d,k. One thing that can help is that "s" appears as a word by itself so is probably either "A" or "I". Also, there is a word in the ciphertext "THsT" which suggests "s" is "A". Also, there is a word "THEa" which might let us guess "a" is "N".

guesses = { "o" : " ", "w" : "E", "v" : "T", "b" : "H", "s" : "A", "a" : "N"}
print(try_decrypt(guesses, ciphertext))
rnkcnu iErzAxATnu rnkcnu THE xEqAdNr kt zArrAlE iEfxdr THAT ENhmqfExEi THE
nkcEx zAxT kt THE ikkxcAu cAr xEqkeEi cdTH TxEqfndNl HANir d qAiE A TdNu
fxEAhH dN THE mzzEx nEtT HANi hkxNEx ANi THEN cdiENdNl THE HknE A ndTTnE
d dNrExTEi THE hANinE ANi zEExEi dN THE HkT Adx ErhAzdNl txkq THE hHAqfEx
hAmrEi THE tnAqE Tk tndhjEx fmT zxErENTnu iETAdnr kt THE xkkq cdTHdN EqExlEi
txkq THE qdrT   hAN ukm rEE ANuTHdNl y

Notice "d" appears as a word by itself so we guess d is I. From the frequencies, the high ciphertext letters we have not yet used are "x" and "k" and the plaintext letters are O, R, S. We see the ciphertext word "Tk" so might guess k is O. Next, we still need to guess "x". Looking at where x appears, we can see "AIx" so lets guess "x" is R (recall we guessed "x" was one of O, R, S and only R makes sense as a word).

guesses = { "o" : " ", "w" : "E", "v" : "T", "b" : "H", "s" : "A",
            "a" : "N", "d" : "I", "k" : "O", "x" : "R"}
print(try_decrypt(guesses, ciphertext))
rnOcnu iErzARATnu rnOcnu THE REqAINr Ot zArrAlE iEfRIr THAT ENhmqfEREi THE
nOcER zART Ot THE iOORcAu cAr REqOeEi cITH TREqfnINl HANir I qAiE A TINu
fREAhH IN THE mzzER nEtT HANi hORNER ANi THEN cIiENINl THE HOnE A nITTnE I
INrERTEi THE hANinE ANi zEEREi IN THE HOT AIR ErhAzINl tROq THE hHAqfER
hAmrEi THE tnAqE TO tnIhjER fmT zRErENTnu iETAInr Ot THE ROOq cITHIN EqERlEi
tROq THE qIrT   hAN uOm rEE ANuTHINl y

Now we can start seeing words, like "ANi" so guess i is D. Now look to the remaining two-letter words. "Ot". So guess t is F or N, but we used N already so guess F. Next, lets go back to the frequencies. The large frequency ciphertext letters we have not yet guessed are n, r, q. The plaintext letters we have not yet used are S, and then D, L. So lets guess one of n, r, or q is S. You can try all three, but also notice the word "INrERTED" so should guess r is S. This gives

guesses = { "o" : " ", "w" : "E", "v" : "T", "b" : "H", "s" : "A",
            "a" : "N", "d" : "I", "k" : "O",  "x" : "R",
            "i" : "D", "t" : "F", "r" : "S"}
print(try_decrypt(guesses, ciphertext))
SnOcnu DESzARATnu SnOcnu THE REqAINS OF zASSAlE DEfRIS THAT
ENhmqfERED THE nOcER zART OF THE DOORcAu cAS REqOeED cITH
TREqfnINl HANDS I qADE A TINu fREAhH IN THE mzzER nEFT HAND
hORNER AND THEN cIDENINl THE HOnE A nITTnE I INSERTED THE hANDnE
AND zEERED IN THE HOT AIR EShAzINl FROq THE hHAqfER hAmSED THE
FnAqE TO FnIhjER fmT zRESENTnu DETAInS OF THE ROOq cITHIN EqERlED
FROq THE qIST   hAN uOm SEE ANuTHINl y

From here you can keep guessing letters. Like "REqAINS" might cause you to guess q is M. "nEFT HAND" might cause you to guess n is L. Guesses like this lead you to the final text, which happens to be the Solution 3 from the Kryptos Statue. But the thing to take away is that we kept guiding our guesses by the frequencies. Also, you need to remember that perhaps some X could be a space above so in particular that list of three spaces in a row might be instead " X ". Or some of the "words" above might be a single word joined by X.

Exercises

Use the above technique to break the following message. I encrypted this message using the substitution cipher and a random key, just like the above example. Make sure you understand how you are using dictionaries and functions and don't just copy and paste the code without understanding.

zbuycrnpuycbvzzeuyhncynypiuhujyrniztneewyiuvcnfeuyemhyunizbymiftzneyc
rnpupindzymruinzujyfwyzbuyvcyxnztmxneynuimxnvztpcynxjycrnpuynjatxtczi
nztmxyyytzcymddtptneyrimsinayxnauyhncycrnpuyzinxcrmiznztmxycwczua,yzn
ouxydimaynyxtxuzuuxyctyzwyxtxuyrenxydmiynycwczuaymdyiuvcnfeuycrnpupin
dzymdyhbtpbytzyhncyzbuymxewytzuaydvxjujydmiyjukuemrauxzyyyzbuydticzym
dydmviymiftzneyzuczydetsbzcymppviiujytxyxtxuzuuxyutsbzwymxuyeunjtxsyz
mymruinztmxneydetsbzcyfustxxtxsytxyxtxuzuuxyutsbzwyzhmyyytzyhncyvcujy
mxynyzmzneymdymxuybvxjiujyzbtizwydtkuyatcctmxcydimayxtxuzuuxyutsbzwym
xuyzmyzhmyzbmvcnxjyueukuxyenvxpbujydimayzbuyouxxujwycrnpuypuxzuiyytxy
demitjnyyymruinztmxneyatcctmxcyenvxpbujyxvauimvcycnzueetzucytxzuirenx
uzniwyrimfucynxjyzbuybvffeuycrnpuyzueucpmruypmxjvpzujycptuxpuyuyruita
uxzcytxymiftzynxjyrniztptrnzujytxypmxczivpztmxynxjycuiktptxsymdyzbuyt
xzuixnztmxneycrnpuycznztmx

The linebreaks mean nothing, it should be all one string.