Shubham Mishra
Each party has a secret key \(K\) and and algorithm (or function) \(f\), such that, give a message \(m\), we can generate a ciphertext \(c = f(m, K)\) easily, but it is difficult to know \(m\) or \(K\), given just the ciphertext \(c\).
Then, even through an insecure channel, we can transmit \(c\). No eavesdropper will be able to decipher the underlying message.
How do I guarantee that my function \(f\) does not leak information about \(m\) and \(k\)?
Cryptography: How do I make these cryptographic algorithms?
Cryptanalysis: How do I break these cryptographic algorithms?
Cryptology = Cryptography + Cryptanalysis
Attacker possesses a string of ciphertext
Attacker possesses a string of plaintext and the corresponding ciphertext.
Attacker can choose plaintext and obtain the corresponding ciphertexts.
Chosen ciphertext > Chosen plaintext > Known plaintext > Ciphertext only
[a-z]
can be mapped to [0-25]
. We will use them interchangeably.For example, if my key is 'c'
or 2
, the string hellocryptography
encrypts to jgnnqetarvqitcrja
.
One way to break the cipher will be to try out every possible key. There are 26 in total. So it is doable.
However, it depends on whether you can get the meaning of the underlying message when you are decrypting. And this technique doesn’t scale that well with the size of the alphabet.
Similar distribution exists for bigrams and trigrams as well.
Assuming the underlying message is written in common English:
One weakness of the shift cipher class was that a particular letter of the alphabet always mapped to a single letter. Such ciphers with one-to-one mappings are called Monoalphabetic cipher.
A notch higher in security are the polyalphabetic ciphers. We are going to discuss one of them.
The key, instead of being a single letter (or number), is a word (a tuple of numbers) in this case.
We take the plaintext and then repeat our keyword such that it becomes as long as our plaintext. Then we simply add letters of the same index together to get the ciphertext.
In fact, for years, Vigenere was thought to be unbreakable!
This is a heuristic approach that gives some probable lengths of the key (let’s call it \(m\)).
Note that, same pieces of plaintext separated by a distance of a multiple of the key length, will map to the same ciphertext.
Hence, we look for repeating patterns in the ciphertext and note down the distances between them (we’ll call these \(\delta_i\)).
We find the gcd of these distances. We have \(m | \delta_i \implies m | gcd(\delta_i)\).
How big a pattern you should use is entirely based on trial-and-error. Usually, the smaller the pattern size, the more chance of getting a false positive.
For a piece of text written in English language, \(I_c\) comes out to be around 0.065.
For a random piece of text, assuming uniform distribution of letters, \(I_c \approx\) 0.038.
So there is a big jump!
Let us denote the ciphertext as \(y = y_1 y_2 ... y_n\).
For any integer \(m\), let us split the ciphertext into \(m\) strings, by picking elements \(m\) distance apart:
\(Y_1 = y_1 y_{m + 1} y_{2m + 1} ...\)
\(Y_2 = y_2 y_{m + 2} y_{2m + 2} ...\)
\(...\)
\(Y_m = y_m y_{2m} y_{3m} ...\)
If indeed \(m\) is the key length, then notice that each of \(Y_i\) is an instance of Shift cipher.
Hence there letter statistics is the same as that of English language.
Hence, in that case, \(I_c(Y_i) \approx 0.065\)
But if \(m\) is not the key length, we will not see such statistics. In practice, the \(I_c\) remains around \(0.04\).
Thus we can get a confirmation on the key lengths output by Kasiski test.
Our next and final task is to find the key \(K\).
Using mutual index of coincidence, we will not be able to find the key exactly, but the relative difference between the letters of the key will be known.
For example, we will get the information like the second character of the key is 5 letters ahead of the first and so on.
Mutual Index of coincidence between 2 strings is defined as the probability that 2 random characters picked from each of the strings is same.
Suppose the strings are \(x = x_1 x_2 x_3 ... x_n\) and \(y = y_1 y_2 ... y_n'\) and the corresponding letter frequencies are \(f_i\) and \(f'_i\) then:
\(MI_c(x, y) = \frac{\sum_{i = 0}^{25} f_i f'_i}{n n'} = \sum p_i p'_i\)
If both \(x\) and \(y\) have the same distribution of letters, mutual index essentially becomes the index of coincidence (well, approximately!).
So, if that is the case, then in case of English language-like strings, we will get the mutual index of coincidence as 0.065.
Remember the \(Y_i\)’s we created in the Index of coincidence test.
We will now calculate the \(MI_c(1, j)\) for every \(j \neq 1\).
If \(MI_c(1, j) \approx 0.065\), we can be sure that \(Y_1\) and \(Y_j\) come from the same distribution.
Which can only happen if they have the same amount of shifts. That is, the key letters in position \(1\) and \(j\) are the same.
If the \(MI_c\) is significantly different, then possibly \(Y_1\) and \(Y_j\) are not having the same letter.
To find the relative difference between them, we keep shifting \(Y_1\) by \(1, 2, ..., 25\).
Let’s say that for shift by \(p\), \(MI_c\) became close to \(0.065\).
This would mean that there is a high chance that the \(j^{th}\) letter of the key is \(p\) ahead of the \(1^{st}\) letter.
We will run this process for all \(Y_j\), thereby calculating the relative differences among the letters of the key with respect to the \(1^{st}\) letter.
The only job now is to determine what the \(1^{st}\) letter is.
This can be easily brute-forced. We can try all 26 combinations.
This is a statistical attack. So there is not guarantee that the revealed key is 100% correct.
It requires a substantial amount of ciphertext to become accurate.
“Cryptography: Theory and Practice” by Douglas R. Stinson
Prof. Debdeep Mukhopadhyay’s lecture on this topic in Cryptography and Network Security course (NPTEL)
This slide deck is available on my website:
grapheo12.github.io/yt-slides/vigenere.html
Source code: github.com/grapheo12/yt-slides/blob/master/vigenere.md