July 20, 2019

Basic CS Concepts to Master For Cryptography (and More)

OK so this is, to some, a boring topic. I like this sort of thing because it's foundational to how computers work. It's the underpinnings of how I can bang on this keyboard to make stuff happen on the screen, how your Tesla autopilot (you have a Tesla right?) knows to not Grand Theft Auto all those pedestrians, and how Facebook is ruining the world with its stupid platform... I digress.

If you've got a computer science degree this is basic CS 101 so maybe it's familiar territory, but I'd wager if you sat down to workshop these concepts you'll find the wheels require of a bit of grease. And, if you've been programming for years and have never thought about any of it, that's ok too! I'm glad you're here. It's time to enlighten yourself.

So what's on the docket? Number bases, binary, hexadecimal, strings and base64.

Disclaimer: There won't be much actual cryptography here but, trust me, what you're about to read applies.

Number Bases

Wherein I try to explain how to count to 10.

What is a number base?

From good ol' Wikipedia...

In mathematics, a base or radix is the number of different digits or combination of digits and letters that a system of counting uses to represent numbers.

"Base"ically (heh) it defines a system of counting. The base tells us what each "place" represents and how many symbols are used to represent the value at each place. The base you're most familiar with is base 10, aka decimal. Let's check out the anatomy of a decimal number.

Say, 4,519.

Here we have a 9 in the 1s place, a 1 in the 10s place, a 5 in the 100s place and a 4 in the 1000s place. In decimal, each place can be represented by ONLY the 10 symbols 0-9.

Consider it a rule that only a single symbol (like "8" or "a" but not "13" or "3$j"), can go in a place. This is obvious in base 10 but becomes more important to consider for other bases, like base 16 or base 64.

So, how do we know that a place is "the 10s place" or "the 1000s place?" If we assume each place has an index, starting at 0 incremented by 1 from right to left then we can use the base to determine the place values. With the following equation.

place = base^index

In our 4,519 example, 9 is at index 0, 1 is at index 1, 5 is at index 2 and 4 is at index 3. So, since we know we are working in base 10 and the 9 in our number is at index 0. We can apply our equation to get the place value where 9 sits like so 10⁰ = 1 i.e. 9 is at the 1's place. Check out these calculations for the first 4 places in decimal:

10⁰ = 1's place
10¹ = 10's place
10² = 100's place
10³ = 1000's place
...

Once we understand this then it begins to make sense what the number 4,519 actually means in base 10.

4     5     1     9  
-     -     -     - 
1000  100   10    1

It's 4 1000s, 5 100s, 1 10 and 9 1s. Or, put another way...

(4 × 10³) + (5 × 10²) + (1 × 10¹) + (9 × 10⁰) = 4,519

Why does this matter?

Seriously, who cares, we intuitively understand how to count in decimal, do we really need to know the fundamentals of something so... fundamental? As it turns out this same logic applies to all base counting systems and even though base 10 makes sense to us humans (see: hands), computers tend to prefer other systems, like base 2 (aka binary) and base 16 (aka hexadecimal).

Binary

At the most fundamental level, binary is all a computer really "knows." We humans have just stacked a whole bunch of abstractions on top of this system (like operating systems and compilers) to allow us to interact with computers in a way that makes more sense to a person. At some point though everything you "do" on a computer winds up as 1s and 0s.

Binary is a base 2 numbering system. Recall that place = base^index so what this means is that our places from right to left look like...

2⁰ = 1
2¹ = 2
2² = 4
2³ = 8
2⁴ = 16
...

In other words each place doubles, starting with 1. Also recall that the base defines how many symbols represent the value at a place. So, in base 2 we use 2 symbols and those are: 1 and 0. Let's say we wanted to represent our decimal number 4,519 in binary what would that look like?

1000110100111

Why? Well, when we look at each digit above it's respective place it may start to make a little more sense:

1    0    0    0    1    1    0    1    0    0    1    1    1
-    -    -    -    -    -    -    -    -    -    -    -    -  
4096 2048 1024 512  256  128  64   32   16   8    4    2    1

And finally, here's that math spelled out:

(1 × 2¹²) + (0 × 2¹¹) + (0 × 2¹⁰) + (0 × 2⁹) + (1 × 2⁸) + (1 × 2⁷) + (0 × 2⁶) + (1 × 2⁵) + (0 × 2⁴) + (0 × 2³) + (1 × 2²) + (1 × 2¹) + (1 × 2⁰) = 4519

Why does this particular number start at the 2¹² place and not 2¹³ or some other value? Well, 2¹² = 4,096 which is the largest place value that 4,519 is divisible by. 2¹³ = 8,192 which is too large.

Bits, Nibbles and Bytes

Now that we've got some background it's important to understand some tech terminology related to the binary number system. For computer cryptography this is important because the terms "bits" and "bytes" are used all the time.

A bit is a single digit in binary. So in 1000110100111 each 1 or 0 is a single bit. We can say 1000110100111 is 13 bits wide.

A byte is a grouping of bits, usually 8 of them. ASCII characters (which you'll learn about later in this post) are represented by 7 bits (so almost a byte). Unicode characters are represented by 4 bytes.

8 bits is more accurately referred to as an octet. There are some systems that call a "byte" something other than a grouping of 8 bits but for our purposes (most purposes really) assuming 8 bit bytes will work just fine.

A nibble 🐟 is a grouping of 4 bits, aka half a byte. This is less common to hear in the wild but it makes for nice trivia so there you go.

Hexadecimal

Also known as base 16.

What is it?

Now that we are masters of number bases and, knowing this is a base 16 system, we should be able to figure it out by hand. The only special knowledge here is that, since each place can be represented by 16 symbols, our "digits" are 0-9 and A-F (either capitalized or not) in that order. That's 16 digits for our base 16 system.

Considering our decimal number 4,519 again, how could we represent this in hex? Well it would take up the following 4 places from left to right...

16⁰ = 1
16¹ = 16
16² = 256
16³ = 4096

Knowing this we can calculate that 4,519 in decimal = 11A7 in hexadecimal. Or...

(1 × 16³) + (1 × 16²) + (a × 16¹) + (7 × 16⁰)

Why is it important?

There are a few reasons you see hex all over computery stuff.

It's more readable than binary but still easily translatable to it (and recall that binary is a computer's native language).
It packs more information into a single digit than binary.
Since 16 is a multiple of 2, a hex digit can ALWAYS be represented by 4 bits (or a nibble).
Hex digits can be represented by a relatively small number of familiar symbols in an order we expect (0-9 and A-F).

Basically, between being nice for computers and decent for humans it is the perfect storm of a number base. Hence, we see it all over the place in the computer science realm and that includes programmatic cryptography.

Strings

So, we know that computers only understand binary and everything more sophisticated than that is a layer of abstraction built upon a binary foundation. How can we possibly represent something like an alphabet in a system that only understands 1s and 0s?

Encoding and standardization.

Encoding can mean subtly different things depending on the context but the gist of it is taking some information (say, 0s and 1s) and transforming it, in a logical way, into something else (say, some text like "He hates these cans!").

Standardization is basically the "logical way" I mentioned above. If someone comes up with a specific way of turning 1s and 0s into, say, text and then enough people use it it becomes a "standard." An early standardized encoding that's still in use today is ASCII (American Standard Code for Information Interchange).

ASCII is an encoding scheme, geared towards English, that allows 128 "characters" to be represented individually by 7 bit integers and we've already learned how integers can be represented in binary.

Why 7 bits and not 8 you may ask? The 8th bit wasn't necessary and there was a time when resources were limited enough that that sort of thing mattered.

ASCII can represent all the letters of the English alphabet (both lowercase and upper case), numbers from 0-9, punctuation and more. Here's an ASCII table:

Dec	Hex	Char
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15	0 1 2 3 4 5 6 7 8 9 A B C D E F	NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI

Dec	Hex	Char
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31	10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F	DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US

Dec	Hex	Char
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47	20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F	(space) ! " # $ % & ' ( ) * + , - . /

Dec	Hex	Char
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63	30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F	0 1 2 3 4 5 6 7 8 9 : ; < = > ?

Dec	Hex	Char
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79	40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F	@ A B C D E F G H I J K L M N O

Dec	Hex	Char
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95	50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F	P Q R S T U V W X Y Z [ \ ] ^ _

Dec	Hex	Char
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111	60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F	` a b c d e f g h i j k l m n o

Dec	Hex	Char
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127	70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F	p q r s t u v w x y z { \| } ~

Some ASCII "characters" (most of the first 30) are now obsolete because they were only required by teletype machines which no longer use today.

Though ASCII is still popular, its size is woefully inadequate. Unicode is one solution. Unicode is basically an umbrella of encodings extending ASCII to provide enough bits to represent everything from languages like Russian and Arabic to emoji.

Why is this important?

First off we're finally getting a glimpse of how a computer represents things that us meat bags can understand. However, more to the point of this article, we can catch a glimpse of how this relates to cryptography. It turns out that most computerized encryption schemes operate on raw binary data not the encoded data.

Say we want to encrypt the string "Meet me at midnight" with the key "ICE" using a simple repeating key XOR scheme. Each character of both the plaintext and the key would have to get converted to something the computer understands and XOR'd to get our ciphertext. We've got the knowledge now to understand what's happening there at the bit level. i.e. If we XOR the two characters A and B what's really happening is the bits that represent the characters A and B are being XOR'd together.

A in hexadecimal is 41 and B in hexadecimal is 42, or 01000001 and 01000010 in binary, respectively.

So, 01000001 ^ 01000010 = 00000011.

00000011 = 03 in hexadecimal, which when ASCII encoded turns out to be an obsolete character (END OF TEXT) but we can leave it as hexadecimal and if we combine all returned hex values we get an encrypted piece of data that we could actually print on a piece of paper. Or we could use...

Base 64

Another number base?? Yes! As it turns out base 64 is good choice for converting raw binary data into string data that most machines will reliably understand.

Raw binary data is problematic because god knows how that information is going to be interpreted. Try printing out a huge blob of binary data to your terminal and you'll get all sorts of garbage from large amounts of whitespace to nonsensical characters to audible beeps. Sending this data over a network can be equally problematic. It needs to be in a format that we know will make it across the wire with a small footprint, no data loss and be something that whatever machine we're communicating with can understand. The safest bet is to represent our binary information it as text.

Base 64 requires 64 symbols to represent its system of counting. Check this chart out:


Value	Char	Value	Char	Value	Char	Value	Char
0	`A`		16	`Q`		32	`g`	48	`w`
1	`B`		17	`R`		33	`h`	49	`x`
2	`C`		18	`S`		34	`i`	50	`y`
3	`D`		19	`T`		35	`j`	51	`z`
4	`E`		20	`U`		36	`k`	52	`0`
5	`F`		21	`V`		37	`l`	53	`1`
6	`G`		22	`W`		38	`m`	54	`2`
7	`H`		23	`X`		39	`n`	55	`3`
8	`I`		24	`Y`		40	`o`	56	`4`
9	`J`		25	`Z`		41	`p`	57	`5`
10	`K`		26	`a`		42	`q`	58	`6`
11	`L`		27	`b`		43	`r`	59	`7`
12	`M`		28	`c`		44	`s`	60	`8`
13	`N`		29	`d`		45	`t`	61	`9`
14	`O`		30	`e`		46	`u`	62	`+`
15	`P`		31	`f`		47	`v`	63	`/`

Some implementations use different characters for values 62 and 63 but + and / are the most common. Notice that all of these characters fit into the small but mighty ASCII encoding!

Why not use hex, decimal, binary even some arbitrary system like base 76 you might ask? These could all fit fine into ASCII, the problem for hex, binary and decimal is that your data would balloon. Base 64 packs much more data into a single symbol than these other systems. The problem with something like base 76 or base 80 (and also decimal) is that since computers tend to work best with number systems whose bases are powers of 2 these other systems become much more difficult to work with programmatically.

Why not base 128? At face value it seems like base 128 checks all the boxes, it's a power of 2, packs even more data into a single character, and, at first glance, fits into ASCII. However, a large chunk of ASCII is made up of control characters, some of them obsolete. So, really they aren't text and this means that, functionally, base 128 doesn't fit into ASCII. Not gonna work.

How does this apply to cryptography?

Since base 64 can be represented by ASCII it means that you can represent a key, plaintext and ciphertext as both something that a human can read and something that can be easily stored in a database. You can also pack so much data into a single digit that you can effectively compress a large amount of binary data into a relatively small amount of text that you could send in an email. Base 64 is awesome.

Conclusion

This is a lot of info, I realize. For me, once I wrapped my mind around number bases the rest sort of clicked into place. These concepts are fundamental to understanding both computerized cryptography and how computers work in general and I'd encourage you to play around with these concepts in your programming language of choice. See what happens when you print binary data, see if you can convert ASCII characters to hex or decimal into base 64. Whatever floats your boat. Most importantly, have fun!

Carter Bancroft

Now

Blog

Good Books

Subscribe

Music

Home