The technique of hiding information in public data is called steganography. The Base64 encoding uses 0-padding when encoding data. It is possible to hide information in this padding, as it is disregarded upon decoding. For efficiently hiding larger amounts multiple strings need to be encoded as one Base64-encoded string can contain 4, 2 or 0 bits of secret text. This article explains the technique, provides a python code for hiding and retrieving the information and shows performance information about the method.
Steganography or cryptography?
Nowadays, it is usually common sense to encrypt your data. That is, scrambling it up, sending it over the internet and unscrambling it on the other side by the right person. Obviously, this has been great practice and a huge step forward towards privacy. But this art which is called cryptography (of which I am very fond by the way) eclipsed its dreamy less pragmatic sibling, steganography. And I am going to attempt to revive it, at least for the beauty of it if not for its practical use.
Look at the text below:
If you are not new to computers, you have definitely already come across the base64 encoding and it is highly likely this is the first thing that comes to mind when you look at this apparent randomness above.
First reflex, use your favourite base64 decoder and retrieve the beginning of 1979’s The Thin Ice song by Pink Floyd.
And that’s it … or is it?
If you find it suspicious that this text was encoded-word by word and not as a whole, then you might look a bit more into it and with a little presence of mind (and an automated tool) you will find that there is a hidden message behind these encoded words.
Keep reading to find out more.
About Base64 and its padding
Base64 is very popular because anything that is binary can be converted to it. If you take all the letters in the English alphabet (uppercase and lowercase) and all the numbers then you would get exactly 62 characters. Add to them the two special characters “/” and “+” and voilà you have a 64-character set, a power of 2, which means that any combination of 6 bits (because 2^6 = 64) can be printed as a character in base64. So if you have an image of 5MB, you can easily split it into 6-bit blocks and then print the base64 equivalent of that particular combination.
To keep things simple, in this article we will only convert ASCII strings to Base64 because ASCII just like Base64 is a visual representation of a set of bits, only it is using 8 bits instead of 6. And it is this difference in bit size that will be of interest. Note that the real ASCII encoding uses 7 bits only, but computers use one full byte (8 bits) to store such characters, which is what we will use as well.
Let’s give an example to set things clear:
ASCII: V = 01010110
Base64: V = 010101
As you can see, the same visual representation (the letter V) is different for the two encodings.
So how does switching from one encoding to the other work?
Let’s take the “ABC” string in ASCII and want to convert it to Base64. We will:
- Convert each ASCII string to its 8-bit representation
- Concatenation of the three 8-bit strings
- Regroup the 24 bits in groups of 6-bit strings
- Replace each 6-bit group with its equivalent in base64
ASCII: 010000010100001001000011 = ABC
Base64: 010000010100001001000011 = QUJD
Notice that the binary representation is the same, but the grouping is different. Thus, the base64 encoding is different as well.
But this is a fortunate example I took because a 3-character ASCII string translates directly to a 4-character base64 string since 24 (the number of bits) is both a multiple of 6 and of 8. Lucky right? And also, this holds for strings of sizes 6, 9 and so on. That would be all the ASCII strings of a size which is a multiple of 3. Let’s call that 3n. Who said basic math was useless?
So what happens when the ASCII string size is not a multiple of 3 ? What happens for 3n+1 and 3n+2?
Let’s check both cases.
ASCII: 01000001 = A
Base64: 01000001 = Q?
ASCII: 0100000101000010 = AB
Base64: 0100000101000010 = QU?
Since the last bits cannot form a whole 6-bit string, the Base64 encoding adds padding to these strings until they reach the appropriate length. And for representation purposes, each “00” padding is represented by an “=” sign, to instruct the decoder how many bits should be discarded from the end of the string.
As a result, we have the following translation, with two equals for 3n+1 and one equal for 3n+2.
ASCII: 01000001 = A
Base64: 010000010000 = QQ==
ASCII: 0100000101000010 = AB
Base64: 010000010100001000 = QUI=
These are the three possible scenarios, so a Base64-encoded string should end with either one, two or no equals.
Ok, fine, but how is this going to be useful?
Well, look again at that padding! Base64 decoders proceed by discarding the last 2 or 4 bits of the string when necessary. So it doesn’t really matter what you pad with since these bits are discarded.
Or does it?
If you are only doing base64 encoding/decoding you just pad with 0s as described in the RFC, but if you are doing steganography then, you could take advantage of these bits being skipped and replace them with your own chosen bits.
Let’s replace the last 4 bits in the 3n+1 encoding above with 0101. We would have the following translation:
ASCII: 01000001 = A
Base64: 010000010101 = QV==
And if you decode QV== unsurprisingly you land on the A ASCII character! Look:
Congratulations you have just hidden the bits 0101 in the base64 encoding of the letter A 😊
The same applies for 3n+2 only there you can only hide only 2 bits of information since the padding is of size 2.
Great, so now we have a way of hiding bits in base64, but hiding only 2 or 4 bits may not be very useful. To encode more bits we would need to have more base64 strings.
This is why in the example at the beginning of the article instead of encoding the whole text as one base64 string I decided to encode each word individually. This would allow me to hide information in each ASCII word with a size which is not of type 3n.
In order not to perform all these computations manually, I wrote a script which does this for me.
There is an encoding mode, which takes in the carrier ASCII text, that is the text in which the message will be hidden, and the secret, which is the message to hide.
And there is a decode mode which retrieves the carrier text as well as the secret from the encode list.
The carrier text will always be able to carry at least as many bits as there are in the secret message, which means that there will always be padding to the secret message. Padding when hiding in the padding, oh the irony. That is why my script encodes the secret in a certain format. Specifically, the secret begins with a 16-bit counter, which contains the number of secret characters that follow. That way, the decoder knows when to stop converting the text back.
Stats for nerds
If you are a sucker for statistics and numbers like I am than you might find this section interesting.
Because the script is done and we can analyze virtually any carrier text, let’s have a look at how efficient this steganography technique is.
On average, we should have an even distribution between the 3n, 3n+1 and 3n+2 words. So for very large texts a third of the text can hide 4 bits, another third 2 bits, and the last third can’t hide anything. On average, that is 2 bits per word. Since a full character is written on 7 bits, which means that on average it takes 3,5 words of carrier text to encode one secret character. By multiplying this number with the average length of the words used in the carrier one should obtain the number of visible characters required to encode one full secret character.
It would be interesting to test this hypothesis over a very large carrier text and see if it holds.
Three iconic books were downloaded in txt format from Guttenberg.com. For each book the average number of characters per word was computed, then multiplied by 3,5 in order to compute the guess as explained above.
Then, the script was run to compute the maximum length of the secret to hide inside the carrier. Finally, the total number of visible characters divided by this number was computed as the real ratio of secret characters that can be hidden.
|Book name||Words||Characters||Chars/word||Expected visible chars/secret char||Computed maximum secret length||Real visible chars/secret char|
|Alice in wonderland||167807||27432||6,117199||21,4102||8224||20,40455|
That gives us an error of 1,72% error. Now that’s a number I can live with!
And I will remember in the future that if I wish to encode secrets with this script I should count about 20 visible characters to hide one secret character, as a finger rule!
Of course, a bit more than that would be required when encoding over small carrier texts because one needs to account for the 16-bit header which contains the length of the text. This header is covered by eight words on average, so the carrier should probably be greater than this. But anyway, don’t worry about the math behind it, the script will tell you how many characters you can encode anyway.
Base64 padding steganography: conclusion
Although not the most efficient technique, the base64 padding steganography is a great way to hide messages in plain sight. And remember, not everything is exactly what it looks like at first glance 😉
From Base64 – CyberChef (gchq.github.io)
Guttenberg – Alice in Wonderland