I've reached what I would call basically "v1.0" of MungeTLS, so I'm going through the exercise of making a pass over the code with a fine-toothed comb, cleaning up lots of things, and documenting the heck out of everything. I do this in preparation for sending it to a few of my friends to code review. I expect to get a lot of scathing comments about my lack of exceptions and usage of goto
s.
Anyway, while doing this pass over my code I noticed some strange stuff having to do with encryption and decryption of TLS 1.1 and 1.2 block ciphers. Let's do an overview of how they work, then the aberrant behavior I found, then a bunch of other investigation stuff to get to the bottom of it. It's been a crazy few days, let me tell you.
Block Ciphers and CBC
In TLS 1.1 and 1.2, the way that block ciphers work was changed. Oops, let me step back a second. There are two common types of "bulk" encryption, or symmetric key encryption used with TLS: stream and block ciphers. Stream ciphers will give you encrypted data byte-for-byte. RC4 is an example of a stream cipher. Block ciphers take a chunk of data and turn it into a series of fixed-size blocks of encrypted data, potentially with some padding at the end. AES is a very popular example of a block cipher.
Block ciphers are vulnerable to using the same key over and over again on data. I don't know exactly why that is, but to combat it, a technique called an "initialization vector" is used to sort of seed the encryption in a way. It's actually fairly simple in implementation, just XORing the IV into the plaintext just before encrypting. Moreover, we're using our block ciphers in "cipher-block-chaining" (CBC) mode, where each block is used as the IV for the next block.
There are some particularly interesting things to glean from this diagram. Firstly, for encryption, you're actually transmitting a series of IVs + ciphertext blocks, since each ciphertext block is implicitly the IV for the next one. That's a pretty weird concept, and effectively also says that it's okay to transmit any IV in the clear.
For decryption, the most interesting thing to see is that if you start decrypting with the wrong IV, only the first block will be incorrect. That's also pretty wacky. I'm not a crypto guy, so it doesn't really make sense to me why this is accepted and okay. Apparently there are other block cipher "modes" than CBC that do a better job of mitigating these concerns, but CBC is maybe the most popular one. Go figure.
Another important thing to note is that the IV size is always the same as the block size, since each block is itself also an IV.
I ended up writing a bunch of sample code to figure out these weird gotchas before looking this stuff up and finding the super useful diagram above, which confirmed these oddities.
Now let's take a step back chronologically, to a time before I had read and figured out all that stuff about IVs, to a time when I noticed some weirdness in my code. The plot arc looks like this: we're going to look at what the RFC seems to be saying we should do. Then we'll look at what I should have implemented. Then we'll look at what I actually implemented. We'll see how I and the RFC are wrong. Finally, we'll have some unsettling conclusions.
The idea of "IVNext", insidiously suggested by the RFC
First, let's take a look at what TLS 1.1 tells us in the RFC. Actually, they don't really tell us much about how the IV should work. It's like they expect me to be some kind of security expert to implement this thing. God, what were they thinking?
// TLS 1.1
block-ciphered struct {
opaque IV[CipherSpec.block_length];
opaque content[TLSCompressed.length];
opaque MAC[CipherSpec.hash_size];
uint8 padding[GenericBlockCipher.padding_length];
uint8 padding_length;
} GenericBlockCipher;
The IV is encrypted here, which means the IV contained within this record actually has to be the "next IV", or IV to be used for decrypting the next record. The initial IV is computed from the master secret identically on both client and server side. In the absence of explicit guidance, this is what a layman would infer from the RFC.
Let's translate that into some heavy pseudocode.
CurrentIV = IV0 (which you get from the key material generation)
function SendMessage(payload)
{
message = CreateMessage()
message.content = payload
NextIV = GenerateNextIV() // IVs should be some random value
message.IV = NextIV // send this to be used on the next ciphertext
bytes = EncryptMessage_TLS_1_1(message, CurrentIV) // current IV set prior
CurrentIV = NextIV
SendBytes(bytes)
}
Great, that's what the RFC suggests I should implement. The problem--moreover, the problem that made me notice the oddities in the first place--is that I didn't implement this correctly, and my code still worked. It's basically a programmer's biggest nightmare.
What happens when we don't implement IVNext properly?
Spoiler alert: nothing. Here's the pseudocode for what I actually implemented.
CurrentIV = IV0 (which you get from the key material generation)
function SendMessage(payload)
{
message = CreateMessage()
message.content = payload
message.IV = CurrentIV // oops! recipient will use the wrong IV!
bytes = EncryptMessage_TLS_1_1(message, CurrentIV)
CurrentIV = GenerateNextIV()
SendBytes(bytes)
}
So I was sending the same IV within the message that I was using to encrypt the message in the first place. The recipient is going to use the wrong IV to decrypt the message, and-and-and everything will go crazy... right?!
That's what I had thought at first, before I read up on how IVs work. Remember back to the explanation earlier: the IV really only affects the first block of decryption. The recipient will decrypt the whole ciphertext, and will get an incorrect decrypted IV value that I sent. But it doesn't care, since that IV value is not even used! Each ciphertext block is used as an IV for the decryption of the next block, and that is all done correctly.
I don't know if I'm explaining that right. Fundamentally, it would never make sense to send the IV encrypted as the first block of ciphertext. For a security expert, this is probably such a patently obvious design flaw that they didn't even need to issue errata about the bug in the GenericBlockCipher
struct in TLS 1.1. It's actually fixed in TLS 1.2, though I incorrectly claimed that it seemed buggy in a previous post.
Fixing the IV code
Well, the first thing we're going to do when fixing it is stop calling anything "IV Next". It's really the IV for this record, not the next one. Here is the correct code.
function SendMessage(payload)
{
message = CreateMessage()
message.content = payload
// also add padding, yeah
message.IV = GenerateNextIV()
bytes = EncryptMessage_TLS_1_1(message.content, message.IV)
// IV in cleartext, bytes are encrypted
SendBytes(message.IV + bytes)
}
Wow, that's much simpler. It's weird that the IV that's generated during key material calculation isn't even used. I think it's used for other cipher types that actually do need it. Anyway, this actually works.
Some final thoughts
I already explained above why it doesn't matter whether we are encrypting the entire payload (IV and content) or just starting from the content--because each ciphertext block depends on the previous block as an IV. That block-sized piece of data immediately preceding the payload is an IV for the encrypted payload no matter how it got there, no matter whether it's plaintext or encrypted. While I understand why it works, it still bugs me that it works both ways.
And I still don't get why it's even safe to send IVs in the clear like this. In TLS 1.0, the IV for a record was the last block of the previous ciphertext, as though all ciphertext was just chained together. It was changed in TLS 1.1 to be included adjacent to the payload it's encrypting, but it's still in the clear. I don't get how it's fundamentally any better.
Any security experts around wanna bust out the math and show me what's up?