July 20, 2019 · cryptography

Basic CS Concepts to Master For Cryptography (and More)

OK so this is, to some, a boring topic. I like this sort of thing because it's foundational to how computers work. It's the underpinnings of how I can bang on this keyboard to make stuff happen on the screen, how your Tesla autopilot (you have a Tesla right?) knows to not Grand Theft Auto all those pedestrians, and how Facebook is ruining the world with its stupid platform... I digress.

If you've got a computer science degree this is basic CS 101 so maybe it's familiar territory, but I'd wager if you sat down to workshop these concepts you'll find the wheels require of a bit of grease. And, if you've been programming for years and have never thought about any of it, that's ok too! I'm glad you're here. It's time to enlighten yourself.

So what's on the docket? Number bases, binary, hexadecimal, strings and base64.

Disclaimer: There won't be much actual cryptography here but, trust me, what you're about to read applies.

Number Bases

Wherein I try to explain how to count to 10.

What is a number base?

From good ol' Wikipedia...

In mathematics, a base or radix is the number of different digits or combination of digits and letters that a system of counting uses to represent numbers.

"Base"ically (heh) it defines a system of counting. The base tells us what each "place" represents and how many symbols are used to represent the value at each place. The base you're most familiar with is base 10, aka decimal. Let's check out the anatomy of a decimal number.

Say, 4,519.

Here we have a 9 in the 1s place, a 1 in the 10s place, a 5 in the 100s place and a 4 in the 1000s place. In decimal, each place can be represented by ONLY the 10 symbols 0-9.

Consider it a rule that only a single symbol (like "8" or "a" but not "13" or "3$j"), can go in a place. This is obvious in base 10 but becomes more important to consider for other bases, like base 16 or base 64.

So, how do we know that a place is "the 10s place" or "the 1000s place?" If we assume each place has an index, starting at 0 incremented by 1 from right to left then we can use the base to determine the place values. With the following equation.

place = baseindex

In our 4,519 example, 9 is at index 0, 1 is at index 1, 5 is at index 2 and 4 is at index 3. So, since we know we are working in base 10 and the 9 in our number is at index 0. We can apply our equation to get the place value where 9 sits like so 100 = 1 i.e. 9 is at the 1's place. Check out these calculations for the first 4 places in decimal:

10⁰ = 1's place
10¹ = 10's place
10² = 100's place
10³ = 1000's place
...

Once we understand this then it begins to make sense what the number 4,519 actually means in base 10.

4     5     1     9  
-     -     -     - 
1000  100   10    1

It's 4 1000s, 5 100s, 1 10 and 9 1s. Or, put another way...

(4 × 10³) + (5 × 10²) + (1 × 10¹) + (9 × 10⁰) = 4,519

Why does this matter?

Seriously, who cares, we intuitively understand how to count in decimal, do we really need to know the fundamentals of something so... fundamental? As it turns out this same logic applies to all base counting systems and even though base 10 makes sense to us humans (see: hands), computers tend to prefer other systems, like base 2 (aka binary) and base 16 (aka hexadecimal).

Binary

At the most fundamental level, binary is all a computer really "knows." We humans have just stacked a whole bunch of abstractions on top of this system (like operating systems and compilers) to allow us to interact with computers in a way that makes more sense to a person. At some point though everything you "do" on a computer winds up as 1s and 0s.

Binary is a base 2 numbering system. Recall that place = baseindex so what this means is that our places from right to left look like...

2⁰ = 1
2¹ = 2
2² = 4
2³ = 8
2⁴ = 16
...

In other words each place doubles, starting with 1. Also recall that the base defines how many symbols represent the value at a place. So, in base 2 we use 2 symbols and those are: 1 and 0. Let's say we wanted to represent our decimal number 4,519 in binary what would that look like?

1000110100111

Why? Well, when we look at each digit above it's respective place it may start to make a little more sense:

1    0    0    0    1    1    0    1    0    0    1    1    1
-    -    -    -    -    -    -    -    -    -    -    -    -  
4096 2048 1024 512  256  128  64   32   16   8    4    2    1

And finally, here's that math spelled out:

(1 × 2¹²) + (0 × 2¹¹) + (0 × 2¹⁰) + (0 × 2⁹) + (1 × 2⁸) + (1 × 2⁷) + (0 × 2⁶) + (1 × 2⁵) + (0 × 2⁴) + (0 × 2³) + (1 × 2²) + (1 × 2¹) + (1 × 2⁰) = 4519

Why does this particular number start at the 212 place and not 2¹³ or some other value? Well, 212 = 4,096 which is the largest place value that 4,519 is divisible by. 213 = 8,192 which is too large.

Bits, Nibbles and Bytes

Now that we've got some background it's important to understand some tech terminology related to the binary number system. For computer cryptography this is important because the terms "bits" and "bytes" are used all the time.

A bit is a single digit in binary. So in 1000110100111 each 1 or 0 is a single bit. We can say 1000110100111 is 13 bits wide.

A byte is a grouping of bits, usually 8 of them. ASCII characters (which you'll learn about later in this post) are represented by 7 bits (so almost a byte). Unicode characters are represented by 4 bytes.

8 bits is more accurately referred to as an octet. There are some systems that call a "byte" something other than a grouping of 8 bits but for our purposes (most purposes really) assuming 8 bit bytes will work just fine.

A nibble 🐟 is a grouping of 4 bits, aka half a byte. This is less common to hear in the wild but it makes for nice trivia so there you go.

Hexadecimal

Also known as base 16.

What is it?

Now that we are masters of number bases and, knowing this is a base 16 system, we should be able to figure it out by hand. The only special knowledge here is that, since each place can be represented by 16 symbols, our "digits" are 0-9 and A-F (either capitalized or not) in that order. That's 16 digits for our base 16 system.

Considering our decimal number 4,519 again, how could we represent this in hex? Well it would take up the following 4 places from left to right...

16⁰ = 1
16¹ = 16
16² = 256
16³ = 4096

Knowing this we can calculate that 4,519 in decimal = 11A7 in hexadecimal. Or...

(1 × 16³) + (1 × 16²) + (a × 16¹) + (7 × 16⁰)

Why is it important?

There are a few reasons you see hex all over computery stuff.

  1. It's more readable than binary but still easily translatable to it (and recall that binary is a computer's native language).
  2. It packs more information into a single digit than binary.
  3. Since 16 is a multiple of 2, a hex digit can ALWAYS be represented by 4 bits (or a nibble).
  4. Hex digits can be represented by a relatively small number of familiar symbols in an order we expect (0-9 and A-F).

Basically, between being nice for computers and decent for humans it is the perfect storm of a number base. Hence, we see it all over the place in the computer science realm and that includes programmatic cryptography.

Strings

So, we know that computers only understand binary and everything more sophisticated than that is a layer of abstraction built upon a binary foundation. How can we possibly represent something like an alphabet in a system that only understands 1s and 0s?

Encoding and standardization.

Encoding can mean subtly different things depending on the context but the gist of it is taking some information (say, 0s and 1s) and transforming it, in a logical way, into something else (say, some text like "He hates these cans!").

Standardization is basically the "logical way" I mentioned above. If someone comes up with a specific way of turning 1s and 0s into, say, text and then enough people use it it becomes a "standard." An early standardized encoding that's still in use today is ASCII (American Standard Code for Information Interchange).

ASCII is an encoding scheme, geared towards English, that allows 128 "characters" to be represented individually by 7 bit integers and we've already learned how integers can be represented in binary.

Why 7 bits and not 8 you may ask? The 8th bit wasn't necessary and there was a time when resources were limited enough that that sort of thing mattered.

ASCII can represent all the letters of the English alphabet (both lowercase and upper case), numbers from 0-9, punctuation and more. Here's an ASCII table:


Dec

Hex

Char
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
NUL
SOH
STX
ETX
EOT
ENQ
ACK
BEL
BS
TAB
LF
VT
FF
CR
SO
SI

Dec

Hex

Char
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
10
11
12
13
14
15
16
17
18
19
1A
1B
1C
1D
1E
1F
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US

Dec

Hex

Char
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
20
21
22
23
24
25
26
27
28
29
2A
2B
2C
2D
2E
2F
(space)
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/

Dec

Hex

Char
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
30
31
32
33
34
35
36
37
38
39
3A
3B
3C
3D
3E
3F
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?

Dec

Hex

Char
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
40
41
42
43
44
45
46
47
48
49
4A
4B
4C
4D
4E
4F
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O

Dec

Hex

Char
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
50
51
52
53
54
55
56
57
58
59
5A
5B
5C
5D
5E
5F
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_

Dec

Hex

Char
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
60
61
62
63
64
65
66
67
68
69
6A
6B
6C
6D
6E
6F
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o

Dec

Hex

Char
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
70
71
72
73
74
75
76
77
78
79
7A
7B
7C
7D
7E
7F
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~

Some ASCII "characters" (most of the first 30) are now obsolete because they were only required by teletype machines which no longer use today.

Though ASCII is still popular, its size is woefully inadequate. Unicode is one solution. Unicode is basically an umbrella of encodings extending ASCII to provide enough bits to represent everything from languages like Russian and Arabic to emoji.

Why is this important?

First off we're finally getting a glimpse of how a computer represents things that us meat bags can understand. However, more to the point of this article, we can catch a glimpse of how this relates to cryptography. It turns out that most computerized encryption schemes operate on raw binary data not the encoded data.

Say we want to encrypt the string "Meet me at midnight" with the key "ICE" using a simple repeating key XOR scheme. Each character of both the plaintext and the key would have to get converted to something the computer understands and XOR'd to get our ciphertext. We've got the knowledge now to understand what's happening there at the bit level. i.e. If we XOR the two characters A and B what's really happening is the bits that represent the characters A and B are being XOR'd together.

A in hexadecimal is 41 and B in hexadecimal is 42, or 01000001 and 01000010 in binary, respectively.

So, 01000001 ^ 01000010 = 00000011.

00000011 = 03 in hexadecimal, which when ASCII encoded turns out to be an obsolete character (END OF TEXT) but we can leave it as hexadecimal and if we combine all returned hex values we get an encrypted piece of data that we could actually print on a piece of paper. Or we could use...

Base 64

Another number base?? Yes! As it turns out base 64 is good choice for converting raw binary data into string data that most machines will reliably understand.

Raw binary data is problematic because god knows how that information is going to be interpreted. Try printing out a huge blob of binary data to your terminal and you'll get all sorts of garbage from large amounts of whitespace to nonsensical characters to audible beeps. Sending this data over a network can be equally problematic. It needs to be in a format that we know will make it across the wire with a small footprint, no data loss and be something that whatever machine we're communicating with can understand. The safest bet is to represent our binary information it as text.

Base 64 requires 64 symbols to represent its system of counting. Check this chart out:

Value Char Value Char Value Char Value Char
0 A 16 Q 32 g 48 w
1 B 17 R 33 h 49 x
2 C 18 S 34 i 50 y
3 D 19 T 35 j 51 z
4 E 20 U 36 k 52 0
5 F 21 V 37 l 53 1
6 G 22 W 38 m 54 2
7 H 23 X 39 n 55 3
8 I 24 Y 40 o 56 4
9 J 25 Z 41 p 57 5
10 K 26 a 42 q 58 6
11 L 27 b 43 r 59 7
12 M 28 c 44 s 60 8
13 N 29 d 45 t 61 9
14 O 30 e 46 u 62 +
15 P 31 f 47 v 63 /

Some implementations use different characters for values 62 and 63 but + and / are the most common. Notice that all of these characters fit into the small but mighty ASCII encoding!

Why not use hex, decimal, binary even some arbitrary system like base 76 you might ask? These could all fit fine into ASCII, the problem for hex, binary and decimal is that your data would balloon. Base 64 packs much more data into a single symbol than these other systems. The problem with something like base 76 or base 80 (and also decimal) is that since computers tend to work best with number systems whose bases are powers of 2 these other systems become much more difficult to work with programmatically.

Why not base 128? At face value it seems like base 128 checks all the boxes, it's a power of 2, packs even more data into a single character, and, at first glance, fits into ASCII. However, a large chunk of ASCII is made up of control characters, some of them obsolete. So, really they aren't text and this means that, functionally, base 128 doesn't fit into ASCII. Not gonna work.

How does this apply to cryptography?

Since base 64 can be represented by ASCII it means that you can represent a key, plaintext and ciphertext as both something that a human can read and something that can be easily stored in a database. You can also pack so much data into a single digit that you can effectively compress a large amount of binary data into a relatively small amount of text that you could send in an email. Base 64 is awesome.

Conclusion

This is a lot of info, I realize. For me, once I wrapped my mind around number bases the rest sort of clicked into place. These concepts are fundamental to understanding both computerized cryptography and how computers work in general and I'd encourage you to play around with these concepts in your programming language of choice. See what happens when you print binary data, see if you can convert ASCII characters to hex or decimal into base 64. Whatever floats your boat. Most importantly, have fun!

Further Reading

Comments powered by Disqus