Part 3: Breaking Repeating Key XOR Programmatically
This is part three in a three part series on repeating key XOR encryption inspired by Cryptopals challenges five and six of set one. In part one I broke down how to understand and implement repeating key XOR. I'd recommend you start there. In part two I described the theory behind breaking repeating key XOR.
Part three (this part) describes how to write up the code behind part two. This will be geared specifically toward Cryptopals challenge six but the concepts apply generally.
As with part one and part two, to grok what follows you shouldn't need much. Basic encryption knowledge and some Python chops will help but if you don't have those, read on anyway, Google is your friend.
Disclaimer: All code is written in Python 3.
The Process
First, let's recall the general outline of cracking repeating key XOR.
- Guess key lengths
- Find the Hamming distance between
KEYLENGTH
chunks of text. The lowest overall distance is likely the key length. - Transpose the original ciphertext into
KEYLENGTH
chunks grouped by key position. - Use frequency analysis with the help of
ETAOIN SHRDLU
to find the key byte for each transposed chunk and combine the key bytes sequentially to get the key. - Decrypt the ciphertext with the key.
Getting Started
In the Cryptopals challenge we're given a file and told that it is base64 encoded. Your ciphertext may not always be like this. For example, it could be an unencoded string of text. The principles remain the same but in this case we'll start by decoding it to raw binary data.
Here's how:
from base64 import b64decode
the_file = open('./6.txt', 'r')
data = the_file.read()
decoded = b64decode(data)
print(decoded)
Now that we've got our decoded binary data we can start building out the steps to decrypt it. The first thing we should nail down is...
Hamming Distance
If you recall, finding the KEYLENGTH
that produces the shortest Hamming distance between two adjacent KEYLENGTH
chunks of ciphertext is probably the true length of the key. The trick to finding Hamming distance is to use XOR (^
in Python).
What does XOR actually do?
XORing two things returns the number represented by all the differing bits between those two things. For example:
97 in bits is 0110 0001
98 in bits is 0110 0010
97 ^ 98 is 0000 0011 (aka the integer 3)
So, we see that 97 ^ 98
returns a value (3
) that has 2 set bits which means the Hamming distance between 97 and 98 is 2.
In Python, binary operators (like XOR) work with numbers (like36 ^ 47
). If you want to XOR characters the easiest thing to do is useord
to convert them to numbers, like so:ord('a') ^ ord('b')
.
OK, how do we count the number of set bits? Bit manipulation.
Binary & and Right Shifting
We check if the right most bit in a byte is set (aka 1
) by using binary AND (&
) and then chopping the right most bit off using right bit shifting (>>
). Chopping it off allows us to easily check if the next bit in line is set.
In our example 97 ^ 98
= 3
, which is 0000 0011
in bits, we would do the following:
0000 0011 & 0000 0001 (aka 1) = 0000 0001 (the right most bit is set, count it)
0000 0011 >> 1 = 0000 0001
Repeat the process for this new byte
I'm using bytes (8 bit groupings), i.e. 0000 0011
, because all single characters in ASCII are represented by 1 byte.
Let's look at what a right bit shift (>>
) does one more time to hammer it home:
15 in decimal is 0000 1111 in binary
15 >> 1 is 0000 0111 in binary or 7 in decimal
7 >> 1 is 0000 0011 in binary or 3 in decimal
3 >> 1 is 0000 0001 in binary or 1 in decimal
1 >> 1 is 0000 0000 in binary or 0 in decimal
Code
Ugh, finally some code, amirite??
def hamming_distance(bytes1, bytes2):
# The strings must be equal length or this will fail.
assert len(bytes1) == len(bytes2)
distance = 0
for zipped_bytes in zip(bytes1, bytes2):
# XOR a bit from bytes1 with the corresponding bit in bytes2
x = zipped_bytes[0] ^ zipped_bytes[1]
set_bits = 0
while (x > 0):
# Check if the right most bit is set. If it is then track it.
set_bits += x & 1;
# Right shift the bits so we can check the next one in line.
x >>= 1;
# Add the number of set bits for the current chars to the total
# distance
distance += set_bits
return distance
b1 = b'this is a test'
b2 = b'wokka wokka!!!'
assert hamming_distance(b1, b2) == 37
And there we have it...
- Zip up the bytes, loop over the zipped groups and XOR them together.
- Find how many bits are set in the XOR'd value using
&
and>>
. - Add the number of set bits to the total distance and repeat the process for the next character.
- Return the total distance when complete.
The Cryptopals challenge specifically provides the "wokka wokka!!!" example for us to verify against. See the assertion in the code above?
Next up we need to use our newly minted hamming_distance
function for...
Getting Key Length
This part entails more code than hamming_distance
but I think it's easier to wrap your mind around.
For each guessed key length between 2 and n (in our case 40), find the total normalized/averaged Hamming distance across all chunks. Then return the KEYLENGTH
that gets us the best score. In this code, that is stored in the best_keylength
var.
I'm just going to jump into the code on this one, see comments for clarification.
def get_keylength(ciphertext):
lowest = None
best_keylength = None
for keylength in range(2, 41):
to_average = []
# Define the starting and ending points for the first chunk
start = 0
end = start + keylength
while (1):
# Grab 2 adjacent chunks of data that are KEYLENGTH long.
first_chunk = ciphertext[start:end]
second_chunk = ciphertext[start + keylength:end + keylength]
# Check if we're at the end of ciphertext. We can ignore the
# dangling bit.
if (len(second_chunk) < keylength):
break
# Find the distance between these two KEYLENGTH chunks
distance = hamming_distance(first_chunk, second_chunk)
"""
"Normalize" the distance. This basically gets it to a decimal
place that is relative to the total keylength so that it can be
compared to distances based on other key lengths.
"""
normalized = distance / keylength
# We've got a score append it to the list of distances we want
# the average of.
to_average.append(normalized)
# Move on to the next chunk that we'll want to get hamming
# distances for.
start = end + keylength
end = start + keylength
# Find the average of those distances and then empty out the array
# for the next iteration.
average = sum(to_average) / len(to_average)
to_average = []
# Check if we've beat the current lowest score. If we have that's
# more likely the correct key length.
if lowest == None or average < lowest:
lowest = average
best_keylength = keylength
return best_keylength
OK, we've potentially figured out our key length. The next step is...
Transposing The Ciphertext
Since we know how long our key is and we know that the key repeats, then every KEYLENGTH
plaintext character must be encrypted with the same key character. So we need to group our data in this way.
Here's an example: If we were to group the text Hello, how are you?
by KEYLENGTH
3, this is what the 3 groups would look like...
Group 1: Hl wry?
Group 2: eoh eo
Group 3: l,oa u
Group 1 was all encrypted by the same char, group 2 was all encrypted by the same char and so on.
Here's how I chose to do this with the ciphertext...
Create a dictionary where its keys are the group positions and each position has an array that gets filled with the bytes for each grouping.
def transpose_chunks_by_keylength(keylength, ciphertext):
# Create a dictionary for the number of chunks that the data can be
# broken into.
chunks = dict.fromkeys(range(keylength))
i = 0
for octet in data:
# If we're at the end of the key start at the beginning again. This
# is "repeating key" XOR after all.
if (i == keylength): i = 0
# If the chunk is null, initialize it to an empty array.
if (chunks[i] == None): chunks[i] = []
# Append the current octet to the chunk.
chunks[i].append(octet)
i += 1
return chunks
Success. Next is...
Getting The Key
We've got a key length and we've got chunks that we know are each encrypted by a single character. This is a pretty big deal. At this stage brute force frequency analysis is our best friend.
For each block, loop over the values in those blocks. XOR each one with all 127 ASCII values and convert what you get back to a string. If it's in ETAOIN SHRDLU
that's a good sign, increment the score for that char. Tack the best scoring char onto our key and return the full key at the end.
Here's what that looks like:
def get_key(blocks):
common = 'ETAOIN SHRDLU'
key = ''
for i in blocks:
current_high_score = 0
current_key_char = ''
for j in range(127):
# Create an array of all the XOR'd
x = [j ^ the_bytes for the_bytes in blocks[i]]
# Convert the array of numbers back into bytes
b = bytes(x)
# Convert to a string so we can compare it to the common
# letters.
b_str = str(b, 'utf-8')
# Increase the score for everywhere there is overlap
score = 0
for k in b_str.upper():
if k in common:
score += 1
# If this score is better for this char, keep it
if score > current_high_score:
current_high_score = score
current_key_char = chr(j)
key += current_key_char
return key
This would be the place where if ETAOIN SHRDLU
wasn't good enough (say, because the plaintext isn't English) you would implement a different frequency system. For example a more robust scoring method or frequencies for another language.
Decrypting
Great, we probably know the key, now for the fun part! We can decrypt the text by taking the bytes of the encrypted message and XORing them with the bytes of the key (repeated for the length of the encrypted message).
Since our get_key
function is returning a string and not bytes, convert each char to a number using ord
so XOR doesn't explode.
def decrypt(message_bytes, key):
decrypted = b''
i = 0
for byte in message_bytes:
# Go back to the beginning of the key if we've reached it's length.
# This handles the "repeating" bit of the key.
if (i == len(key)):
i = 0
# Convert the key char to a number so it can be XOR'd
xor = byte ^ ord(key[i])
# Convert the xor'd value back to bytes... bytes(...) requires an
# array as an argument, hence the [...]
decrypted += bytes([xor])
i += 1
return decrypted
The Rest
Huzzah! We've got everything we need. Hopefully you've been testing your code along the way but here's what it might look like to run all the steps end to end.
from base64 import b64decode
the_file = open('./6.txt', 'r')
data = the_file.read()
# We know the file is base64 encoded so decode it first, this converts it
# to raw bytes.
decoded = b64decode(data)
# First we need to take a stab at finding the key length.
keylength = get_keylength(decoded)
# Once we have a good key length transpose the chunks
chunks = transpose_chunks_by_keylength(keylength, decoded)
# Get the key from the transposed chunks
key = get_key(chunks)
# Decrypt the ciphertext
decrypted = decrypt(decoded, key)
# Find out what it is!
print(decrypted)
Profit.
Conclusion
I hope you walk away from this grokking what repeating key XOR is and how to break it. But perhaps more importantly, I hope you've learned some basic (but foundational) concepts around programmatic cryptography.
One thing I struggled with throughout was knowing what data types I should be working with at any given time. Text strings? Byte strings? Hex? Base64?? It's crazy confusing. However, with practice and experimentation you'll get a feel for the data and these things will become more intuitive.
Play around with this code. See if you can implement it yourself without reference. Convert the key or the ciphertext bytes to some other data type and see how it plays out. You may be surprised by what you discover.