Part 3: Breaking Repeating Key XOR Programmatically

This is part three in a three part series on repeating key XOR encryption inspired by Cryptopals challenges five and six of set one. In part one I broke down how to understand and implement repeating key XOR. I'd recommend you start there. In part two I described the theory behind breaking repeating key XOR.

Part three (this part) describes how to write up the code behind part two. This will be geared specifically toward Cryptopals challenge six but the concepts apply generally.

As with part one and part two, to grok what follows you shouldn't need much. Basic encryption knowledge and some Python chops will help but if you don't have those, read on anyway, Google is your friend.

Disclaimer: All code is written in Python 3.

The Process

First, let's recall the general outline of cracking repeating key XOR.

  1. Guess key lengths
  2. Find the Hamming distance between KEYLENGTH chunks of text. The lowest overall distance is likely the key length.
  3. Transpose the original ciphertext into KEYLENGTH chunks grouped by key position.
  4. Use frequency analysis with the help of ETAOIN SHRDLU to find the key byte for each transposed chunk and combine the key bytes sequentially to get the key.
  5. Decrypt the ciphertext with the key.

Getting Started

In the Cryptopals challenge we're given a file and told that it is base64 encoded. Your ciphertext may not always be like this. For example, it could be an unencoded string of text. The principles remain the same but in this case we'll start by decoding it to raw binary data.

Here's how:

from base64 import b64decode

the_file = open('./6.txt', 'r')
data = the_file.read()

decoded = b64decode(data)
print(decoded)

Now that we've got our decoded binary data we can start building out the steps to decrypt it. The first thing we should nail down is...

Hamming Distance

If you recall, finding the KEYLENGTH that produces the shortest Hamming distance between two adjacent KEYLENGTH chunks of ciphertext is probably the true length of the key. The trick to finding Hamming distance is to use XOR (^ in Python).

What does XOR actually do?

XORing two things returns the number represented by all the differing bits between those two things. For example:

97 in bits is 0110 0001
98 in bits is 0110 0010
97 ^ 98    is 0000 0011 (aka the integer 3)

So, we see that 97 ^ 98 returns a value (3) that has 2 set bits which means the Hamming distance between 97 and 98 is 2.

In Python, binary operators (like XOR) work with numbers (like 36 ^ 47). If you want to XOR characters the easiest thing to do is use ord to convert them to numbers, like so: ord('a') ^ ord('b').

OK, how do we count the number of set bits? Bit manipulation.

Binary & and Right Shifting

We check if the right most bit in a byte is set (aka 1) by using binary AND (&)  and then chopping the right most bit off using right bit shifting (>>). Chopping it off allows us to easily check if the next bit in line is set.

In our example 97 ^ 98 = 3, which is 0000 0011 in bits, we would do the following:

0000 0011 & 0000 0001 (aka 1) = 0000 0001 (the right most bit is set, count it)
0000 0011 >> 1 = 0000 0001
Repeat the process for this new byte
I'm using bytes (8 bit groupings), i.e.  0000 0011, because all single characters in ASCII are represented by 1 byte.

Let's look at what a right bit shift (>>) does one more time to hammer it home:

15 in decimal is 0000 1111 in binary
15 >> 1       is 0000 0111 in binary or 7 in decimal
7  >> 1       is 0000 0011 in binary or 3 in decimal
3  >> 1       is 0000 0001 in binary or 1 in decimal
1  >> 1       is 0000 0000 in binary or 0 in decimal

Code

Ugh, finally some code, amirite??

def hamming_distance(bytes1, bytes2):
  # The strings must be equal length or this will fail.
  assert len(bytes1) == len(bytes2)

  distance = 0
  for zipped_bytes in zip(bytes1, bytes2):
    # XOR a bit from bytes1 with the corresponding bit in bytes2
    x = zipped_bytes[0] ^ zipped_bytes[1]

    set_bits = 0
    while (x > 0):
      # Check if the right most bit is set. If it is then track it.
      set_bits += x & 1;

      # Right shift the bits so we can check the next one in line.
      x >>= 1; 

    # Add the number of set bits for the current chars to the total
    # distance
    distance += set_bits

  return distance


b1 = b'this is a test'
b2 = b'wokka wokka!!!'

assert hamming_distance(b1, b2) == 37

And there we have it...

  1. Zip up the bytes, loop over the zipped groups and XOR them together.
  2. Find how many bits are set in the XOR'd value using & and >>.
  3. Add the number of set bits to the total distance and repeat the process for the next character.
  4. Return the total distance when complete.

The Cryptopals challenge specifically provides the "wokka wokka!!!" example for us to verify against. See the assertion in the code above?

Next up we need to use our newly minted hamming_distance function for...

Getting Key Length

This part entails more code than hamming_distance but I think it's easier to wrap your mind around.

For each guessed key length between 2 and n (in our case 40), find the total normalized/averaged Hamming distance across all chunks. Then return the KEYLENGTH that gets us the best score. In this code, that is stored in the best_keylength var.

I'm just going to jump into the code on this one, see comments for clarification.

def get_keylength(ciphertext):
  lowest = None
  best_keylength = None

  for keylength in range(2, 41):
    to_average = []

    # Define the starting and ending points for the first chunk
    start = 0
    end = start + keylength
    while (1):
      # Grab 2 adjacent chunks of data that are KEYLENGTH long.
      first_chunk = ciphertext[start:end]
      second_chunk = ciphertext[start + keylength:end + keylength]

      # Check if we're at the end of ciphertext. We can ignore the
      # dangling bit.
      if (len(second_chunk) < keylength):
          break

      # Find the distance between these two KEYLENGTH chunks
      distance = hamming_distance(first_chunk, second_chunk)

      """
      "Normalize" the distance. This basically gets it to a decimal
      place that is relative to the total keylength so that it can be
      compared to distances based on other key lengths.
      """
      normalized = distance / keylength

      # We've got a score append it to the list of distances we want
      # the average of.
      to_average.append(normalized)

      # Move on to the next chunk that we'll want to get hamming
      # distances for.
      start = end + keylength
      end = start + keylength

    # Find the average of those distances and then empty out the array
    # for the next iteration.
    average = sum(to_average) / len(to_average)
    to_average = []

    # Check if we've beat the current lowest score. If we have that's
    # more likely the correct key length.
    if lowest == None or average < lowest:
      lowest = average
      best_keylength = keylength

  return best_keylength

OK, we've potentially figured out our key length. The next step is...

Transposing The Ciphertext

Since we know how long our key is and we know that the key repeats, then every KEYLENGTH plaintext character must be encrypted with the same key character. So we need to group our data in this way.

Here's an example: If we were to group the text Hello, how are you? by KEYLENGTH 3, this is what the 3 groups would look like...

Group 1: Hl wry?
Group 2: eoh eo
Group 3: l,oa u

Group 1 was all encrypted by the same char, group 2 was all encrypted by the same char and so on.

Here's how I chose to do this with the ciphertext...

Create a dictionary where its keys are the group positions and each position has an array that gets filled with the bytes for each grouping.

def transpose_chunks_by_keylength(keylength, ciphertext):
  # Create a dictionary for the number of chunks that the data can be
  # broken into.
  chunks = dict.fromkeys(range(keylength))

  i = 0
  for octet in data:
    # If we're at the end of the key start at the beginning again. This
    # is "repeating key" XOR after all.
    if (i == keylength): i = 0

    # If the chunk is null, initialize it to an empty array.
    if (chunks[i] == None): chunks[i] = []

    # Append the current octet to the chunk.
    chunks[i].append(octet)

    i += 1

  return chunks

Success. Next is...

Getting The Key

We've got a key length and we've got chunks that we know are each encrypted by a single character. This is a pretty big deal. At this stage brute force frequency analysis is our best friend.

For each block, loop over the values in those blocks. XOR each one with all 127 ASCII values and convert what you get back to a string. If it's in ETAOIN SHRDLU that's a good sign, increment the score for that char. Tack the best scoring char onto our key and return the full key at the end.

Here's what that looks like:

def get_key(blocks):
  common = 'ETAOIN SHRDLU'
  key = ''

  for i in blocks:
    current_high_score = 0
    current_key_char = ''

    for j in range(127):
      # Create an array of all the XOR'd
      x = [j ^ the_bytes for the_bytes in blocks[i]]
            
      # Convert the array of numbers back into bytes
      b = bytes(x)

      # Convert to a string so we can compare it to the common
      # letters.
      b_str = str(b, 'utf-8')

      # Increase the score for everywhere there is overlap
      score = 0
      for k in b_str.upper():
          if k in common:
              score += 1

      # If this score is better for this char, keep it
      if score > current_high_score:
        current_high_score = score
        current_key_char = chr(j)

    key += current_key_char

  return key
This would be the place where if ETAOIN SHRDLU wasn't good enough (say, because the plaintext isn't English) you would implement a different frequency system. For example a more robust scoring method or frequencies for another language.

Decrypting

Great, we probably know the key, now for the fun part! We can decrypt the text by taking the bytes of the encrypted message and XORing them with the bytes of the key (repeated for the length of the encrypted message).

Since our get_key function is returning a string and not bytes, convert each char to a number using ord so XOR doesn't explode.

def decrypt(message_bytes, key):
  decrypted = b''

  i = 0
  for byte in message_bytes:
    # Go back to the beginning of the key if we've reached it's length.
    # This handles the "repeating" bit of the key.
    if (i == len(key)):
      i = 0

    # Convert the key char to a number so it can be XOR'd
    xor = byte ^ ord(key[i])
    
    # Convert the xor'd value back to bytes... bytes(...) requires an
    # array as an argument, hence the [...]
    decrypted += bytes([xor])

    i += 1

  return decrypted

The Rest

Huzzah! We've got everything we need. Hopefully you've been testing your code along the way but here's what it might look like to run all the steps end to end.

from base64 import b64decode

the_file = open('./6.txt', 'r')
data = the_file.read()

# We know the file is base64 encoded so decode it first, this converts it
# to raw bytes.
decoded = b64decode(data)

# First we need to take a stab at finding the key length.
keylength = get_keylength(decoded)

# Once we have a good key length transpose the chunks
chunks = transpose_chunks_by_keylength(keylength, decoded)

# Get the key from the transposed chunks
key = get_key(chunks)

# Decrypt the ciphertext
decrypted = decrypt(decoded, key)

# Find out what it is!
print(decrypted)

Profit.

Conclusion

I hope you walk away from this grokking what repeating key XOR is and how to break it. But perhaps more importantly, I hope you've learned some basic (but foundational) concepts around programmatic cryptography.

One thing I struggled with throughout was knowing what data types I should be working with at any given time. Text strings? Byte strings? Hex? Base64?? It's crazy confusing. However, with practice and experimentation you'll get a feel for the data and these things will become more intuitive.

Play around with this code. See if you can implement it yourself without reference. Convert the key or the ciphertext bytes to some other data type and see how it plays out. You may be surprised by what you discover.

Resources