barn cover: 201206

A little oversight I had in the development of MungeTLS was that in all my testing, I'd basically hard-coded a certificate for the server to use. Heh, at early points, I had literally hard-coded it in the CPP file as a huge byte array tucked away at the bottom. It later progressed to calling the Windows cert store APIs to find a certificate and grab its encoded representation.

I didn't really touch it again, since it was working pretty well, but now that I've gotten most other crypto stuff working, it's time to make it a little more useful. One important thing I've always had to do in SSL client library testing is handling of different certificate chains. Different bitness of certificate keys, different "usage" OIDs allowed, out of order certificate chains, and so on. As a start, MungeTLS needs to support sending more than a measly one certificate in the Certificate handshake message.

This effort wasn't really that complicated or time consuming. It came down to switching all my PCCERT_CONTEXT variables to PCCERT_CHAIN_CONTEXT. The chain context structure gives you access to an array of PCCERT_CONTEXT structures, which is exactly what I need.

Correspondingly, in an unprecedented show of convenience, there is simply a method to grab a whole cert chain, given a PCCERT_CONTEXT: CertGetCertificateChain. It's beautiful! As far as parameters go, the only strange one that doesn't have a good default value is a PCERT_CHAIN_PARA structure. And of this, the only required parameter is RequestedUsage, which is nearly optional. Here's what I use to initialize it:

CERT_CHAIN_PARA chainPara = {0};

chainPara.cbSize = sizeof(chainPara);

// indicates not to use this member
chainPara.RequestedUsage.dwType = USAGE_MATCH_TYPE_AND;

if (!CertGetCertificateChain(
         NULL,
         pCertContext,
         NULL,
         NULL,
         &chainPara,
         0,
         NULL,
         &pCertChain))
{
    hr = HRESULT_FROM_WIN32(GetLastError());
    goto error;
}

Interacting with the resultant chain structure is also easy. Here's where I get access to the PCCERT_CONTEXT that I was previously being passed directly:

spPubKeyCipherer.reset(new WindowsPublicKeyCipherer());
hr = spPubKeyCipherer->Initialize(
         (*CertChain())->rgpChain[0]->rgpElement[0]->pCertContext);

There's only one small hiccup in this structure: notice that I have to traverse two levels of arrays. I think of a certificate chain as a simple array, where element 0 is the leaf and element size-1 is the self-signed root certificate. But apparently Windows gives us a list of chains. I don't know what this means, exactly. In the simple example I'm using--three certificates in a chain--Windows returns me only one list. I wonder what circumstances produce more than one. I bet it's some weird smartcard thing.

And here's the code where I create a Certificate handshake message with multiple certificates in the chain:

PCCERT_CHAIN_CONTEXT pCertChain = *ConnParams()->CertChain();

// what does it mean to have 2 simple chains? no support for now
assert(pCertChain->cChain == 1);

PCERT_SIMPLE_CHAIN pSimpleChain = pCertChain->rgpChain[0];

for (DWORD i = 0; i < pSimpleChain->cElement; i++)
{
    hr = certificate.AddCertificateFromMemory(
             pSimpleChain->rgpElement[i]->pCertContext->pbCertEncoded,
             pSimpleChain->rgpElement[i]->pCertContext->cbCertEncoded);

    if (hr != S_OK)
    {
        goto error;
    }
}

At first I was manually just mashing all of the encoded representations together in a byte vector and sending those raw over the wire. That's not actually correct according to the RFC, though. Here's the relevant section:

opaque ASN.1Cert<1..2^24-1>;

struct {
    ASN.1Cert certificate_list<0..2^24-1>;
} Certificate;

You have to kind of know the notation in the RFC doc to really grok this, but angle brackets indicate a "vector", which is always preceded by a length field just large enough to represent the range indicated. So by mashing together all the certificates, I was not correctly setting the length of the individual certificates in the blob. An illustration:

Wrong:

Certificate length,
certificate_list[0],
certificate_list[1],
and certificate_list[2]

00 00 0f 01 02 03 04 11 12 13 14 15 21 22 23 24 25 26

Right:

Certificate length,

certificate_list[0].length,
certificate_list[0],

certificate_list[1].length,
certificate_list[1],

certificate_list[2].length,
and certificate_list[2]

00 00 0f 00 00 04 01 02 03 04 00 00 05 11 12 13 14 15 00 00 06 21 22 23 24 25 26

Not so subtle once you look at it side by side, right? In the first case you have no idea how long each individual certificate is. That was probably an overblown, unnecessary example. But we're at the end! This all works. See?

Showing the Windows UI of the certificate chains in IE

A thought on platform agnosticism

It's worth pointing out that this is one area of the code I haven't yet made platform-agnostic. Ideally the caller would pass in a list of X.509 certificates in their byte representation, the leaf certificate's private key, and the leaf certificate's public key. Maybe it's possible for me to parse the public key out of the certificate, but that's going to require understanding the X.509 format... I'm not sure I signed up for that!

Perhaps a better method is to make two initialization possibilities for the Windows public key encryptor object: one that takes the certificate store parameters to look it up in the Windows cert store and one that takes some well known format, like PFX. Still, the goal is for someone to be able to rapidly do testing using the tools they already have, without requiring a lot of weird conversions. This effort needs some more thinking.

At this point in my MungeTLS project, I think I've taken care of basically all the hard stuff I initially had planned. I'll talk about it in a minute, but I've implemented I think all of the cryptographic stuff needed for a simple TLS server. Let's run down the list of features:

TLS 1.0 support
TLS 1.2 support
stream ciphers such as RC4-128
block ciphers such as AES-256
newer hash functions such as SHA-256

At the end of the last post, I had just gotten a TLSCiphertext structure's fragment to decrypt using a stream cipher. That's a pretty good step, but stream ciphers are puny, simple things. Block ciphers are where things get interesting. But first, we hadn't--er--finished the Finished message. Let's get to it.

Finish the Fight

The purpose of the Finished message is to tie off the TLS handshake by making sure that both parties are using the same key material and didn't see any tampering in the middle of the connection. First, the client sends a Finished message to the server, then the server sends a (different) one back.

The contents of the Finished message are simple: a hash of all of the messages received so far during the handshake, and in the order received. This proves both points with regards to key material and tampering. The pair of Finished messages are also the first two messages encrypted under the just-negotiated security parameters.

Here's how the Finished hash is calculated:

struct {
    opaque verify_data[12];
} Finished;

verify_data
    PRF(master_secret, finished_label, MD5(handshake_messages) +
    SHA-1(handshake_messages)) [0..11];

The master secret is known to us. The finished_label is just a static string, "client finished" or "server finished", but the handshake messages are interesting. First of all, we're going to have to store some additional state in the connection, keeping track of all handshake messages seen so far. Secondly, what exactly do we store? The entire message, including record-layer headers? Actually, the RFC is nice enough to explain it properly: only data in the Handshake message is stored--so no headers from the TLSCiphertext part. Oh, and of course the hash omits the Finished message that it is inside, itself.

By the time of reception of a client Finished message, I've typically tracked the following messages:

working on the following handshake messages:
    ClientHello
    ServerHello
    Certificate
    ServerHelloDone
    ClientKeyExchange

Computed Finished hash with label "client finished":
AC D2 32 D5 C8 7F AF 1F 15 C6 12 EE 
Received Finished hash:
AC D2 32 D5 C8 7F AF 1F 15 C6 12 EE

Note that the ChangeCipherSpec message won't be included here, since it is not actually a handshake message. The only bugs or difficulty I encountered in this code were around not tracking all the handshake messages and in the right order.

Having checked the security of the client's Finished message successfully, I also had to create and send one from the server. This one only differs in content from the client's, because it includes the client's Finished message itself.

Time to Hash it out again

Actually, before we could send our own Finished message back to the client and finally put a fork in this handshake, we have to implement record integrity, that is, compute a MAC to attach to the message so that the other party can verify its integrity. As a baby step towards this, I implemented verifying the client Finished message's MAC that I'd just received.

To make it more concrete, let's look again at the stream cipher's structure:

stream-ciphered struct {
    opaque content[TLSCompressed.length];
    opaque MAC[CipherSpec.hash_size];
} GenericStreamCipher;

The MAC is generated as:

   HMAC_hash(MAC_write_secret, seq_num + TLSCompressed.type +
                 TLSCompressed.version + TLSCompressed.length +
                 TLSCompressed.fragment));

where "+" denotes concatenation.

seq_num
   The sequence number for this record.

hash
   The hashing algorithm specified by
   SecurityParameters.mac_algorithm.

This is formulaic as always, but--hey, what's this? seq_num is a new thing. The sequence number... hmm. What is that? Search for "seq_num". No hits. Well, except for the description that very helpfully points out to laymen that "seq_num" means "sequence number". It's strange that they don't really specify it in a nice, clean way. Still, the information is hiding, spread out a bit in the RFC.

There are separate sequence numbers for the "read" state and the "write" state. They are 64 bits long and reset to zero whenever a connection state is made "active." When is that, exactly? It's not super obvious, but with some careful reading, it seems to be immediately after a ChangeCipherSpec message is sent or received. The sequence number is incremented after every subsequent record-layer message (like TLSPlaintext or TLSCiphertext) is sent or received. All of this took some good, old-fashioned shot-in-the-dark experimentation, but I've got it now.

You're doing it wrong -- your encryption, that is

So I was able to correctly verify the MAC on an inbound record. I likewise computed a MAC for an outbound record and attached it. I was feeling pretty good, until the client hit me with an Alert message. I knew it was an alert because the content type was 21, but oddly enough, upon decryption, I couldn't read the payload at all! OpenSSL at least was kind enough to tell me the alert was fatal bad_record_mac. Hey, I thought I had this decryption stuff working!

I stumbled on this for quite a while, doing what I've had to do in the past to debug this crazy crypto stuff: scour my code for errors. I found some, too, but nothing that helped. Finally, something I read in the RFC tickled the back of my brain.

For stream ciphers that do not use a synchronization vector (such as RC4), the stream cipher state from the end of one record is simply used on the subsequent packet.

I feel like I've read something like this elsewhere... How about the docs for CryptDecrypt?

The following extra actions are performed when Final is TRUE: blah blah blah... If the cipher is a stream cipher, the next CryptDecrypt call resets the cipher to its initial state.

Hey, wait, that doesn't sound good! The stream must go on, according to the RFC, right? Of course I was setting the Final parameter to CryptDecrypt to TRUE, because previously I was only encrypting once. I merely flipped this to FALSE and things magically started working. Even my hash computation was correct!

Blocks That Matter

This is not to be confused by the excellent Xbox Live Indie Game of the same name, of course. We're talking about block cipher encryption, such as AES. Let's start of by looking at the structure, as usual.

block-ciphered struct {
    opaque content[TLSCompressed.length];
    opaque MAC[CipherSpec.hash_size];
    uint8 padding[GenericBlockCipher.padding_length];
    uint8 padding_length;
} GenericBlockCipher;

Careful, now; this is TLS 1.0 block ciphering, not TLS 1.2, which has a key difference, which we'll come to later. The interesting part of this is the rules about the padding. Let me outline them here.

The total length of the plaintext, including padding and the padding length field, must be a multiple of the block size. For AES-128, for instance, that is 16 bytes.
The value of every byte in the padding field is equal to the padding length.

So, an example. Let's say we're using AES-128, which has 16 byte blocks. And let's pretend we have 10 bytes of content + MAC.

content (10 bytes):
01 02 03 04 05 06 07 08 09 0a

content (10 bytes) with 5 padding bytes and padding length (1 byte):
01 02 03 04 05 06 07 08 09 0a 05 05 05 05 05 05

total length of content is 10 + 5 + 1 == 16 == block length

One implication of the way this padding works is that without doing a little math, we don't know how long the content is. We have to examine the last byte to see how much padding to expect, then subtract the padding length, padding lenght field length, and MAC length from the total length of the structure. The remainder is the content size. With some careful pointer arithmetic, there's no problem.

Block Ciphers in CryptAPI

So there's a thing that CryptDecrypt doesn't tell you about block ciphers. It plays weird truncation games with the padding. Let's go back to our example from before. Let's say you decrypted something that produced the plaintext we looked at above (the 10 bytes of content, plus padding). After calling CryptDecrypt, you'd actually end up with this:

decrypted content from CryptDecrypt
01 02 03 04 05 06 07 08 09 0a 05

Hey! Where's the rest of my padding?! I don't know why Windows does this, but I just worked around it. Since this is down in a "core" crypto part of the code, and I want to shelter the TLS part of the code from implementation differences, I actually reconstruct the padding and padding length before returning this plaintext to the caller. Luckily I have that padding length byte at the end, so I can do this.

Let's revisit our dear, dear friend, endianness, yet again. We're talking about block ciphers now. What byte order do they use? As a reminder, RSA encryption used host byte order, so I had to reverse it from the network. RSA is kind of a block cipher, right? So other block ciphers should also use host byte order. Yeah, that would make sense.

Too much sense, if you hadn't guessed from my incredibly leading commentary. For whatever reason, both types of symmetric encryption--stream and block--use network byte order. Whatever. It was only two possibilities to try. I just wish the docs were clearer on this.

How about the Final parameter that we looked at before? Actually, each individual block is treated as a "final" block. Again, I just had to experiment; some of this is not intuitive at all. This is the reason this crypto stuff scares me. By design, you have very little to go on, so debugging it feels like groping blindly at possibilities.

Just hook me up to an IV, Windows

There's another gotcha related to block ciphers, which is that they typically take an initialization vector, or an IV. This is yet another reason I tackled the simpler stream ciphers first. The IV is a sort of "seed" for the encryption. I don't think I mentioned it before, but the first IV is generated as part of the whole key block after the master secret is computed. That IV is used to encrypt only the first chunk of ciphertext, and then is thrown away. The last ciphertext block of that chunk is used as the IV for the next chunk of ciphertext and so on.

Notice I use these weird terms, "chunk" and "block". The RFC is fairly precise about it, but it doesn't hurt to get clarity. Remember how in AES-128, a ciphertext "block" was 16 bytes? Well, if you have a 20 byte chunk of plaintext, that gets padded to be 32 bytes of ciphertext, or two blocks. So the last block, or 16 bytes, is used as the next IV, regardless of how big your ciphertext payload just was. By the way, this technique is called Cipher Block Chaining (or CBC).

I need to make sure I set the IV for encryption just before encrypting my next payload. Windows lets me do this, but with oddities. There's actually a whole snippet of sample code in the docs for CryptDecrypt about how to set the IV properly.

What it comes down to is calling CryptSetKeyParam with KP_IV, but there's a trick to it, because this only takes effect after the next Final encryption. To work around this, you can set the IV and duplicate the HCRYPTKEY (using the staggeringly original CryptDuplicateKey), which initializes itself with the pending IV of the original key. It's a bit weird, but not that bad, since they spell out how to do it.

TLS 1.2, why you gotta be so WRONG?

After doing all of the above, I got block ciphers and stream ciphers totally working with both OpenSSL and IE. That's a pretty good deal. Along the way I had always implemented TLS 1.2 after getting TLS 1.0 working, even though OpenSSL doesn't even support it. Most of the implementation I've been talking about for this whole post works as-is for both versions, but there is one minor difference and one major difference worth mentioning.

The minor difference is that the hash used in the Finish message now uses SHA-256 rather than a combo of MD5 and SHA-1. Whoopee. Done.

The major difference is that block ciphers no longer use IVs the way they do in TLS 1.0. Instead of using the last ciphertext block as the IV for the next encryption, each record randomly generates and encrypts the IV for the next record as part of the payload. The first IV is generated just like in TLS 1.0, as part of the key block. This helps thwart attacks that snoop the tail end of the previous payload as a hint into how to decrypt the next payload. It seems like an unlikely attack vector, but any bit counts, right? Defense in depth, as they say.

So here is the TLS 1.2 block cipher structure:

struct {
    opaque IV[SecurityParameters.record_iv_length];
    block-ciphered struct {
        opaque content[TLSCompressed.length];
        opaque MAC[SecurityParameters.mac_length];
        uint8 padding[GenericBlockCipher.padding_length];
        uint8 padding_length;
    };
} GenericBlockCipher;

Look at this critically. Do you see anything strange about it? Here's a hint: look at which part of it is encrypted. Doesn't it strike you as odd that the IV is outside the encrypted portion? It certainly struck me as odd, considering the point of this is to prevent the next IV from being snooped. Oh well. I went ahead and implemented it as it says. After all, the RFC has been specific and accurate so far.

Some time debugging later, and I'm pretty well convinced this is actually a bug. IE, the only TLS 1.2 implementation I tested with, does not understand this structure at all. On the contrary, it is encrypting the IV along with the rest of the payload. The structure should look like this:

// TLS 1.knutaf ha ha

block-ciphered struct {
    opaque IV[SecurityParameters.record_iv_length];
    opaque content[TLSCompressed.length];
    opaque MAC[SecurityParameters.mac_length];
    uint8 padding[GenericBlockCipher.padding_length];
    uint8 padding_length;
} GenericBlockCipher;

When implemented this way, I was able to get IE working properly with TLS 1.2. Who's right? I have no idea. Who's to say? A point of interest: TLS 1.2 is disabled by default in IE, probably for compatibility reasons, but I couldn't hazard a more specific guess.

What's next!

Well, that was fun! I think most of the "hard" stuff is done in this project--by that I mean all the crazy crypto that requires me to wildly swing my hammer, as in wack-a-mole. The next phase of the project is cleanup: I have an embarrassingly large amount of magic numbers strewn about. I need to refactor stuff that got ugly. I have errant members that shouldn't be, and others with the wrong degree of privacy. I haven't actually implemented support for alerts, so I ought to do that.

But I probably won't write another blog post until I'm into the phase beyond that, which is about designing the interface to the application. The what? Don't remember what the point of this whole project was? We'll come back to that, don't worry.

barn cover

Blog Archive

About Me

20120607

A Quick Bit on Cert Chains