This blog post reports on some recent research we’ve been doing at Wickr together with cryptographers at NYU and IOHK. We’ve taken a closer look at the current proposal for the Messaging Layer Security (MLS) protocol – an upcoming open standard under development by the IETF’s eponymous working group. I’ll talk about a weakness we’ve identified in the current design. I’ll also describe the fix we came up with to address the issue (currently under consideration for adoption by the MLS workgroup).

The post assumes some passing familiarity with MLS and its goals. In particular the Forward Security and Post Compromise Security properties it aims for. So, if you like to brush up on that check out my previous blog post introducing MLS.

### 1. Updates & The MLS Key Schedule.

Put simply, the issue with the current MLS proposal we’ve been looking at in this research project concerns its PCS security; namely how effectively/quickly it heals from past compromise.

At a high level, a group
of *n* users in an MLS group chat session share a distributed session
state. The state includes *2n* (asymmetric) decryption keys and each user
knows a particular subset of *log(n)* of these keys. This structure allows
for very efficiently encrypting data to arbitrary subsets of the group. One way
this is used is for “healing”. User’s take turns performing a, so called, “update”
operation. This involves choosing fresh random values for their *log(n)*
keys and distributing each new value to the other holders of that key.

A bit more precisely,
this is done by encrypting each new decryption key to the subset of all users
that know the corresponding old value *except* for the person actually
doing the update. In effect, this means that an adversary that managed to learn
Alice’s state (including all her decryption keys) will not be able to decrypt
any of the new value’s she sends out when she performs her next update. But
everyone else who does know the old keys will be able to use one of their *other*
keys to decrypt the new value distributed by Alice. This mechanism forms the core
of the PCS guarantee provided by MLS.

The mechanism also provides some FS as the old keys are overwritten whenever they are updated to a new value. In particular, after such an update, a state compromise won’t reveal anything about their old values so older ciphertexts encrypted to them remain secure.

**Epochs.** We call the period between any two such
updates an “epoch”. In other words, an epoch is characterized (amongst other
things) by the fixed values of the *2n* keypairs in the session’s distributed
sate. Each epoch state also includes a secret value known to all current group members
called the “epoch secret”. This is used to deterministically derive a whole
host of other (symmetric) keys, nonces and other values. For example, from this
epoch secret, users can derive a different symmetric key / nonce pair for each text
message sent by each user during the epoch. This deterministic key schedule, rooted
at the epoch secret, allows parties to encrypt and send application messages
during the epoch without the need for any further synchronization between them.
Moreover (just like the symmetric ratchets in a double-ratchet protocol) the
key schedule allows receivers to always *immediately* decrypt incoming
messages for the epoch regardless of the order in which they were received and while
also maintaining FS for any previously delivered messages.

Of course, as a side effect of these features, the epoch secret is immensely valuable
to an attacker; knowing it means knowing all decryption keys for that epoch!
Thus, MLS defines an additional mechanism making it harder to derive epoch
secrets. An updated by a user results in a packet containing a set of ciphertexts
being sent out to all other group members. The protocol guarantees that all
group members except for the sender can decrypt at least one of those
ciphertexts. Moreover, decrypting any one of those ciphertexts reveals a secret
value that permits the receiver to derive the so called “update secret” for the
epoch. (The sender trivially knows the update secret as they are the one doing
the encrypting in the first place.) Now, to make harder for the adversary to
obtain epoch secrets, MLS defines epoch secrets to depend not just on that
epoch’s update secret, but also (indirectly) on the previous epoch’s epoch
secret. Specifically, from each epoch secret users deterministically derive an “init
secret” for that epoch. Once the next epoch is started (i.e. upon the next update)
the new epoch secret is the result of an HKDF call that includes both the new update
secret and the previous epoch’s init secret as input.

Summarizing from the attacker’s perspective, this means that to compromise all messages sent during a target epoch it’s enough to learn that corresponding epoch secret. Moreover, one way to learn that secret is to learn the target epoch’s update secret as well as the previous epoch’s init secret.

### 2. PCFS: The Problem with MLS.

Suppose, now that Alice’s state has leaked and that she has just performed an update initiating new epoch j. (For simplicity, we’re assuming here that only Alice was compromised. If that’s not the case then the rest of this discussion applies to the last of the corrupted parties to perform an update.)

Beginning from when Alice’s
state was leak and running all the way up to when Alice’s performs her update, the
attacker can decrypt essentially anything Alice can meaning there can be no
expectation of privacy during this time period. Moreover, Alice’s leaked state
must have included the current init secret for that epoch (as she herself needed
it to compute the next epoch secret). So, it’s fair to assume the attacker too
was able to compute the init secrets all the way up to epoch j-1 (just as Alice
can). Now, we said that the update packet starting epoch j was built by Alice in
such a way that none of the decryption keys she new pre-update can be used to
process the update packet. That means, the adversary does *not* have the means
to derive the new update secret and so cannot compute the new epoch secret
either. Thus, at this point we would like to be able to claim that security (e.g.
privacy) for all messages sent during epoch j (and later) have been restored;
at least until the next compromise.

Recall that, a basic
property we expect from any secure message protocol is Forward Secrecy. That is
all previously delivered messages should remain private regardless of future (cryptographic)
state leakage. Intuitively, *if the adversary couldn’t already decrypt the
message, then compromising the receiver (after they received the message) shouldn’t
help the adversary do so.* In the above example, we could reasonably hope
for FS for messages sent during epoch j. After all, we established that the
adversary cannot recover the epoch secret so without further compromise all encryption
keys used in that epoch remain unknown to the adversary.

However, the FS security notion does allow the adversary to compromise other group members after they have received a target ciphertext sent during epoch j. Now, the symmetric key schedule for epoch j (rooted at it’s epoch secret) ensures that, as soon as a ciphertext is decrypted the corresponding decryption key is immediately deleted and can no longer be derived from the receiver’s state. So, at first glance, it looks like we do indeed have FS for epoch j.

But that’s not the
whole story. Another way to reproduce all decryption keys for epoch j is for
the adversary to learn the update secret distributed in Alice’s update packet.
Combined with the previous epoch’s init secret (which the adversary already
knows) this allows the adversary to recompute the epoch secret and thus the *entire*
symmetric key schedule for epoch j! How can the adversary learn the update secret?
Well, if they manage to leak a state that includes an asymmetric decryption key
which can process Alice’s update packet (i.e. decrypt one of its ciphertexts) then
the adversary could do just that. In other words, if ever the adversary learns such
an asymmetric decryption key then privacy of (all application messages sent
during) epoch j is retroactively lost.

We’ll call any such asymmetric
decryption key “critical” for epoch j if knowing the key allows the adversary
to recover the epoch secret for epoch j. In particular, *FS for epoch j can only
kick in once all critical keys have been deleted from all group members states.*

**Too Many Critical Keys
& Too Slowly Refreshed.** Put
succinctly, our analysis of MLS showed that that there are really a surprisingly
large number of critical keys lying around in people’s states. Worse, under
normal conditions it will likely take an unpleasantly long time before they are
all replaced in the group state (and thus deleted from everyone’s local states).
What this amounts to, is that once a single compromise has occurred, an MLS session
could well take an unreasonably long time to recover FS for subsequent messages
*even though the compromised party has updated*. What’s worse, it’s not
all that difficult for an attacker observing only the *encrypted* network traffic
to determine who still has local copy of the critical keys for any given target
epoch. This means its all the easier for an attacker to spend their, presumably
limited, resources effectively.

Put simply, we observe
the following fact about the *2n* asymmetric keys in the target epoch’s
distributed state. An MLS update packet contains *log(n)* ciphertexts each
encrypted to a different one of those keys. Decrypting any single ciphertext
reveals the update secret of the target epoch. Thus, we have at least *log(n)*
“primary” critical keys.

Consider the primary
critical key. It turns out that all (but one) of those keys was distributed to
other group members as part of some previous update packet. What’s more, some
of the keys that could be used to process that previous update packet also allow
for recovering the primary critical key distributed in that upate. We’ll call
such keys “secondary” critical keys for epoch j because learning them also permits
the adversary to recover the update secret for epoch j by first recovering a
primary critical key. And this process continues. Most of the “secondary”
critical keys were themselves distributed via even early updates. Some of the keys
that can process those updates also allow recovering the secondary critical key
distributed in those updates making these “tertiary” critical keys. It turns
out that this recursion can be repeated up to *log(n)* making for a total
of roughly *n* critical keys in the distributed group state during epoch j.

The issue is compounded by the fact that most of the asymmetric keys are known to only a very small subset of group members. In other words only very members have the ability to trigger a key refresh. That’s because the only mechanism MLS provides to replace keys is via and updates by a group member that knows the corresponding secret key. To see why this is a problem not that, *half* the keys are each only known by a *single* party. A further subset of *n/2* keys is each known only to 2 parties. Another 8^{th} of keys are each known to 4 parties, etc. So, if a critical key ends up in one of these subsets (which most do) then we can expect the critical key to remain part of the group state across many (at least *O(n)*) epochs! Hardly an ideal scenario in terms of achieving FS for our target epoch.

### 3. A Second Key-Update Mechanism: Updateable Public Key Encryption.

To help get faster FS, we propose a modification to the MLS protocol which introduces a second mechanism through which the asymmetric keys are refreshed. As a critical building block, we use a new type of asymmetric encryption which we call *Updateable Public Key Encryption* (UPKE). It’s tightly related to the similar concept in this paper on the double-ratchet paradigm.

Basically, UPKE is syntactically similar to standard PKE (like the HPKE scheme currently used by MLS) except that encryption/decryption also re-randomizes the PK/SK, respectively. In more detail, just as in standard PKE, the encryption algorithm takes as input a public key and a message. However, along with the usual ciphertext it also outputs an “updated” version of the public key (meant to replace the previous version of the key). Similarly, the decryption algorithm outputs not just the message but also the “updated” secret key. UPKE guarantees that this key update mechanism is homomorphic in the sense that as long as the original PK and SK formed a valid key pair then so do the updated versions. In an upcoming blog post I’ll talk more about this primitive, covering things like how it can be implemented based on various popular elliptic curve groups (e.g. P-251 and X25519) based on standard libraries (e.g. libsodium, OpenSSL, etc.) and what other practical applications it has outside of MLS such as for hierarchical cryptocurrency wallets and in Tor. For now though, I’ll stick to how we use it to improve on MLS.

Basically, the idea is simply to replace standard public key encryption (specifically HPKE) with a UPKE. (More practically speaking we actually propose an updateable version of HPKE.) The key observation here is that this way, as soon as an update packet is received the simple act of processing it automatically results in replacing the primary critical key that was just used. Clearly, this already helps us because at least that one critical key has now been removed from the group state without the need for anyone to perform another update.

In fact, it actually goes
a long way towards improving things. So much so, that we could show that no protocol
can recover FS faster in a setting where all parties receive update packets in
the same order (though not necessarily at the same time). This is a particular interesting
scenario because MLS already requires the delivery service to ensure just such a
global ordering on updates as its also required to keep group members local
states coherent with each other. (It’s worth noting though that the global
ordering assumption is not required for the security of MLS, at least beyond
mounting DOS attacks.) In other words, we prove that under this global ordering
assumption the new MLS variant is *optimal* in terms of how fast it
recovers FS.

To get an idea for why note that asymmetric keys are distributed in such a way by MLS that for any given update each member has (at most) a single primary critical key with which they can process that update. What about the secondary critical keys? Well this invariant applies recursively to the update that distributed the primary recursive key. That is that party had (at most) a single secondary critical key. What’s more, it was immediately deleted by the UPKE decryption process when that previous update was processed. The same argument holds for the tertiary and all other critical keys. So basically, at the start of each epoch we already see that the only critical key (namely a primary one) is left in the group state and it is removed as soon as the key holders process the update starting the epoch.

### 4. Where to Find More About This

A great place to find lots more about this project is on its Project Page over at Wickr’s Crypto Research website. I also gave a talk about these and related results at Real World Crypto 2020 which took place in NYC between January 8-10. You can find the slides for that talk here.

If you’d like to know more about the hardcore technical details of this research, UPKE and our new variant of MLS, I suggest having a look at the full version of our writeup over on eprint. In it, you’ll also find a bunch of other results around MLS and the new construction not discussed here. Things like new formal security definitions and proofs for the core part of MLS – called TreeKEM – which encompasses the update mechanism and the management of the asymmetric keys in the group state in MLS.

Finally, keep an eye out for my upcoming (and past) blog posts on UPKE, MLS and other crypto related topics.