Opening on Kegan's behalf
[MSC4284](https://github.com/matrix-org/matrix-spec-proposals/pull/4284)
has already been opened accordingly.
---------
Co-authored-by: Kegan Dougal <7190048+kegsay@users.noreply.github.com>
Co-authored-by: Eric Eastwood <erice@element.io>
Spawning from
https://github.com/matrix-org/synapse/pull/12588#discussion_r865843321
> It turns out `Deferred.cancel()` is a lot like
`Deferred.callback()`/`errback()` in that it will trash the logging
context:
> it can resume a coroutine, which will restore its own logging context,
then run:
>
> - until it blocks, setting the sentinel context
> - or until it terminates, setting the context it was started with
>
> So we need to wrap it in `with PreserveLoggingContext():`, like we do
with `.callback()`:
>
> ```python
> with PreserveLoggingContext():
> self.render_deferred.cancel()
> ```
>
> *-- @squahtx,
https://github.com/matrix-org/synapse/pull/12588#discussion_r865843321*
Regressed in
https://github.com/element-hq/synapse/pull/18900#discussion_r2331554278
(see conversation there for more context)
### How is this a regression?
> To give this an update with more hindsight; this logic *was* redundant
with the early return and it is safe to remove this complexity
✅
>
> It seems like this actually has to do with completed vs incomplete
deferreds...
>
> To explain how things previously worked *without* the early-return
shortcut:
>
> With the normal case of **incomplete awaitable**, we store the
`calling_context` and the `f` function is called and runs until it
yields to the reactor. Because `f` follows the logcontext rules, it sets
the `sentinel` logcontext. Then in `run_in_background(...)`, we restore
the `calling_context`, store the current `ctx` (which is `sentinel`) and
return. When the deferred completes, we restore `ctx` (which is
`sentinel`) before yielding to the reactor again (all good
✅)
>
> With the other case where we see a **completed awaitable**, we store
the `calling_context` and the `f` function is called and runs to
completion (no logcontext change). *This is where the shortcut would
kick in but I'm going to continue explaining as if we commented out the
shortcut.* -- Then in `run_in_background(...)`, we restore the
`calling_context`, store the current `ctx` (which is same as the
`calling_context`). Because the deferred is already completed, our extra
callback is called immediately and we restore `ctx` (which is same as
the `calling_context`). Since we never yield to the reactor, the
`calling_context` is perfect as that's what we want again (all good
✅)
>
> ---
>
> But this also means that our early-return shortcut is no longer just
an optimization and is *necessary* to act correctly in the **completed
awaitable** case as we want to return with the `calling_context` and not
reset to the `sentinel` context. I've updated the comment in
https://github.com/element-hq/synapse/pull/18964 to explain the
necessity as it's currently just described as an optimization.
>
> But because we made the same change to
`run_coroutine_in_background(...)` which didn't have the same
early-return shortcut, we regressed the correct behavior ❌ . This is
being fixed in https://github.com/element-hq/synapse/pull/18964
>
>
> *-- @MadLittleMods,
https://github.com/element-hq/synapse/pull/18900#discussion_r2373582917*
### How did we find this problem?
Spawning from @wrjlewis
[seeing](https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$h3TxxPVlqC6BTL07dbrsz6PmaUoZxLiXnSTEY-QYDtA?via=jki.re&via=matrix.org&via=element.io)
`Starting metrics collection 'typing.get_new_events' from sentinel
context: metrics will be lost` in the logs:
<details>
<summary>More logs</summary>
```
synapse.http.request_metrics - 222 - ERROR - sentinel - Trying to stop RequestMetrics in the sentinel context.
2025-09-23 14:43:19,712 - synapse.util.metrics - 212 - WARNING - sentinel - Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost
2025-09-23 14:43:19,713 - synapse.rest.client.sync - 851 - INFO - sentinel - Client has disconnected; not serializing response.
2025-09-23 14:43:19,713 - synapse.http.server - 825 - WARNING - sentinel - Not sending response to request <XForwardedForRequest at 0x7f23e8111ed0 method='POST' uri='/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=281963%2Fs929324_147053_10_2652457_147960_2013_25554_4709564_0_164_2&timeout=30000' clientproto='HTTP/1.1' site='8008'>, already dis
connected.
2025-09-23 14:43:19,713 - synapse.access.http.8008 - 515 - INFO - sentinel - 92.40.194.87 - 8008 - {@me:wi11.co.uk} Processed request: 30.005sec/-8.041sec (0.001sec, 0.000sec) (0.000sec/0.002sec/2) 0B 200! "POST /_matrix/client/unstable/org.matrix.simplified_msc3575/
```
</details>
From the logs there, we can see things relating to
`typing.get_new_events` and
`/_matrix/client/unstable/org.matrix.simplified_msc3575/sync` which led
me to trying out Sliding Sync with the typing extension enabled and
allowed me to reproduce the problem locally. Sliding Sync is a unique
scenario as it's the only place we use `gather_optional_coroutines(...)`
-> `run_coroutine_in_background(...)` (introduced in
https://github.com/element-hq/synapse/pull/17884) to exhibit this
behavior.
### Testing strategy
1. Configure Synapse to enable
[MSC4186](https://github.com/matrix-org/matrix-spec-proposals/pull/4186):
Simplified Sliding Sync which is actually under
[MSC3575](https://github.com/matrix-org/matrix-spec-proposals/pull/3575)
```yaml
experimental_features:
msc3575_enabled: true
```
1. Start synapse: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Make a Sliding Sync request with one of the extensions enabled
```http
POST
http://localhost:8008/_matrix/client/unstable/org.matrix.simplified_msc3575/sync
{
"lists": {},
"room_subscriptions": {
"!FlgJYGQKAIvAscfBhq:my.synapse.linux.server": {
"required_state": [],
"timeline_limit": 1
}
},
"extensions": {
"typing": {
"enabled": true
}
}
}
```
1. Open your homeserver logs and notice warnings about `Starting ...
from sentinel context: metrics will be lost`
Part of https://github.com/element-hq/synapse/issues/18905
Lints for ensuring we use `Clock.call_later` instead of
`reactor.callLater`, etc are coming in
https://github.com/element-hq/synapse/pull/18944
### Testing strategy
1. Configure Synapse to log at the `DEBUG` level
1. Start Synapse: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Wait 10 seconds for the [database profiling
loop](9cc4001778/synapse/storage/database.py (L711))
to execute
1. Notice the logcontext being used for the `Total database time` log
line
Before (`sentinel`):
```
2025-09-10 16:36:58,651 - synapse.storage.TIME - 707 - DEBUG - sentinel - Total database time: 0.646% {room_forgetter_stream_pos(2): 0.131%, reap_monthly_active_users(1): 0.083%, get_device_change_last_converted_pos(1): 0.078%}
```
After (`looping_call`):
```
2025-09-10 16:36:58,651 - synapse.storage.TIME - 707 - DEBUG - looping_call - Total database time: 0.646% {room_forgetter_stream_pos(2): 0.131%, reap_monthly_active_users(1): 0.083%, get_device_change_last_converted_pos(1): 0.078%}
```
Introduce `Clock.call_when_running(...)` to wrap startup code in a
logcontext, ensuring we can identify which server generated the logs.
Background:
> Ideally, nothing from the Synapse homeserver would be logged against the `sentinel`
> logcontext as we want to know which server the logs came from. In practice, this is not
> always the case yet especially outside of request handling.
>
> Global things outside of Synapse (e.g. Twisted reactor code) should run in the
> `sentinel` logcontext. It's only when it calls into application code that a logcontext
> gets activated. This means the reactor should be started in the `sentinel` logcontext,
> and any time an awaitable yields control back to the reactor, it should reset the
> logcontext to be the `sentinel` logcontext. This is important to avoid leaking the
> current logcontext to the reactor (which would then get picked up and associated with
> the next thing the reactor does).
>
> *-- `docs/log_contexts.md`
Also adds a lint to prefer `Clock.call_when_running(...)` over
`reactor.callWhenRunning(...)`
Part of https://github.com/element-hq/synapse/issues/18905
Closes: #18436
Implements:
https://github.com/matrix-org/matrix-spec-proposals/pull/4308
Follows: #18674
Adds an extension to Sliding Sync and a companion
endpoint needed for backpaginating missed thread subscription changes,
as described in MSC4308
---------
Signed-off-by: Olivier 'reivilibre <oliverw@matrix.org>
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
Spawning from https://github.com/element-hq/synapse/pull/18871
[This change](6ce2f3e59d)
was originally used to fix CPU time going backwards when we `daemonize`.
While, we don't seem to run into this problem on `develop`, I still
think this is a good change to make. We don't need background tasks
running on a process that will soon be forcefully exited and where the
reactor isn't even running yet. We now kick off the background tasks
(`run_as_background_process`) after we have forked the process and
started the reactor.
Also as simple note, we don't need background tasks running in both halves of a fork.
Introduced in: https://github.com/element-hq/synapse/pull/17167
The endpoint was part of experiments for MSC3575 but does not feature in
that MSC.
Signed-off-by: Olivier 'reivilibre <oliverw@matrix.org>
Switch to OpenTracing's `ContextVarsScopeManager` instead of our own
custom `LogContextScopeManager`.
This is now possible because the linked Twisted issue from the comment
in our custom `LogContextScopeManager` is resolved:
https://twistedmatrix.com/trac/ticket/10301
This PR is spawning from exploring different possibilities to solve the
`scope` loss problem I was encountering in
https://github.com/element-hq/synapse/pull/18804#discussion_r2268254424.
This appears to solve the problem and I've added the additional test
from there to this PR ✅
Spawning from wanting to confirm my replies in
https://github.com/element-hq/synapse/issues/18489
We're now using the same source of truth of the list of tables being
purged in the tests. For example, we weren't testing that
`local_current_membership` was cleared out before because the lists were
out of sync.
This PR reverts https://github.com/element-hq/synapse/pull/18751
### Why revert?
@reivilibre
[found](https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$u9OEmMxaFYUzWHhCk1A_r50Y0aGrtKEhepF7WxWJkUA?via=matrix.org&via=node.marinchik.ink&via=element.io)
that our CI was failing in bizarre ways (thanks for stepping up to dive
into this 🙇). Examples:
- `twisted.internet.error.ProcessTerminated: A process has ended with a
probable error condition: process ended by signal 9.`
- `twisted.internet.error.ProcessTerminated: A process has ended with a
probable error condition: process ended by signal 15.`
<details>
<summary>More detailed part of the log</summary>
https://github.com/element-hq/synapse/actions/runs/16758038107/job/47500520633#step:9:6809
```
tests.util.test_wheel_timer.WheelTimerTestCase.test_single_insert_fetch
===============================================================================
Error:
Traceback (most recent call last):
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/disttrial.py", line 371, in task
await worker.run(case, result)
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/worker.py", line 305, in run
return await self.callRemote(workercommands.Run, testCase=testCaseId) # type: ignore[no-any-return]
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1187, in __iter__
yield self
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/protocols/amp.py", line 1968, in _massageError
error.trap(RemoteAmpError)
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 431, in trap
self.raiseException()
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 455, in raiseException
raise self.value.with_traceback(self.tb)
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 9.
tests.util.test_macaroons.MacaroonGeneratorTestCase.test_guest_access_token
-------------------------------------------------------------------------------
Ran 4325 tests in 669.321s
FAILED (skips=159, errors=62, successes=4108)
while calling from thread
Traceback (most recent call last):
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/base.py", line 1064, in runUntilCurrent
f(*a, **kw)
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/base.py", line 790, in stop
raise error.ReactorNotRunning("Can't stop reactor that isn't running.")
twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running.
joining disttrial worker #0 failed
Traceback (most recent call last):
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1853, in _inlineCallbacks
result = context.run(
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
return g.throw(self.value.with_traceback(self.tb))
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/worker.py", line 406, in exit
await endDeferred
File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1187, in __iter__
yield self
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 15.
```
</details>
With more debugging (thanks @devonh for also stepping in as maintainer),
we were finding that the CI was consistently failing at
`test_exposed_to_prometheus` which was a bit of smoke because of all of
the [metrics
changes](https://github.com/element-hq/synapse/issues/18592) that were
merged recently.
Locally, although I wasn't able to reproduce the bizarre errors, I could
easily see increased memory usage (~20GB vs ~2GB) and the
`test_exposed_to_prometheus` test taking a while to complete when
running a full test run (`SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial
tests`).
<img width="1485" height="78" alt="Lots of memory usage"
src="https://github.com/user-attachments/assets/811e2a96-75e5-4a3c-966c-00dc0512cea9"
/>
After updating `test_exposed_to_prometheus` to dump the
`latest_metrics_response = generate_latest(REGISTRY)`, I could see that
it's a massive 3.2GB response. Inspecting the contents, we can see 4.1M
(4,137,123) entries for just
`synapse_background_update_status{server_name="test"} 3.0` which is a
`LaterGauge`. I don't think we have 4.1M test cases so it's also unclear
why we end up with so many samples but it does make sense that we do see
a lot of duplicates because each `HomeserverTestCase` will create a
homeserver for each test case that will `LaterGauge.register_hook(...)`
(part of the https://github.com/element-hq/synapse/pull/18751 changes).
`tests/storage/databases/main/test_metrics.py`
```python
latest_metrics_response = generate_latest(REGISTRY)
with open("/tmp/synapse-test-metrics", "wb") as f:
f.write(latest_metrics_response)
```
After reverting the https://github.com/element-hq/synapse/pull/18751
changes, running the full test suite locally doesn't result in memory
spikes and seems to run normally.
### Dev notes
Discussion in the
[`#synapse-dev:matrix.org`](https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$vkMATs04yqZggVVd6Noop5nU8M2DVoTkrAWshw7u1-w?via=matrix.org&via=node.marinchik.ink&via=element.io)
room.
### Pull Request Checklist
<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->
* [x] Pull request is based on the develop branch
* [ ] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
- Use markdown where necessary, mostly for `code blocks`.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [ ] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
Fix `LaterGauge` metrics to collect from all servers
Follow-up to https://github.com/element-hq/synapse/pull/18714
Previously, our `LaterGauge` metrics did include the `server_name` label
as expected but we were only seeing the last server being reported in
some cases. Any `LaterGauge` that we were creating multiple times was
only reporting the last instance.
This PR updates all `LaterGauge` to be created once and then we use
`LaterGauge.register_hook(...)` to add in the metric callback as before.
This works now because we store a list of callbacks instead of just one.
I noticed this problem thanks to some [tests in the Synapse Pro for
Small Hosts](https://github.com/element-hq/synapse-small-hosts/pull/173)
repo that sanity check all metrics to ensure that we can see each metric
includes data from multiple servers.
### Testing strategy
1. This is only noticeable when you run multiple Synapse instances in
the same process.
1. TODO
(see test that was added)
### Dev notes
Previous non-global `LaterGauge`:
```
synapse_federation_send_queue_xxx
synapse_federation_transaction_queue_pending_destinations
synapse_federation_transaction_queue_pending_pdus
synapse_federation_transaction_queue_pending_edus
synapse_handlers_presence_user_to_current_state_size
synapse_handlers_presence_wheel_timer_size
synapse_notifier_listeners
synapse_notifier_rooms
synapse_notifier_users
synapse_replication_tcp_resource_total_connections
synapse_replication_tcp_command_queue
synapse_background_update_status
synapse_federation_known_servers
synapse_scheduler_running_tasks
```
### Pull Request Checklist
<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->
* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
- Use markdown where necessary, mostly for `code blocks`.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
This can be reviewed commit by commit
There are a few improvements over the experimental support:
- authorisation of Synapse <-> MAS requests is simplified, with a single
shared secret, removing the need for provisioning a client on the MAS
side
- the tests actually spawn a real server, allowing us to test the rust
introspection layer
- we now check that the device advertised in introspection actually
exist, making it so that when a user logs out, the tokens are
immediately invalidated, even if the cache doesn't expire
- it doesn't rely on discovery anymore, rather on a static endpoint
base. This means users don't have to override the introspection endpoint
to avoid internet roundtrips
- it doesn't depend on `authlib` anymore, as we simplified a lot the
calls done from Synapse to MAS
We still have to update the MAS documentation about the Synapse setup,
but that can be done later.
---------
Co-authored-by: reivilibre <oliverw@element.io>
We do this by a) not pulling out all membership events, and b) batch
inserting bans.
One blocking concern is that this bypasses the `update_membership`
function, which otherwise all other membership events go via. In this
case it's fine (having audited what it is doing), but I'm hesitant to
set the precedent of bypassing it, given it has a lot of logic in there.
---------
Co-authored-by: Eric Eastwood <erice@element.io>
Follow-up to https://github.com/element-hq/synapse/pull/18604
Previously, our cache metrics did include the `server_name` label as
expected but we were only seeing the last server being reported. This
was caused because we would
`CACHE_METRIC_REGISTRY.register_hook(metric_name, metric.collect)` where
the `metric_name` only took into account the cache name so it would be
overwritten every time we spawn a new server.
This PR updates the register logic to include the `server_name` so we
have a hook for every cache on every server as expected.
I noticed this problem thanks to some [tests in the Synapse Pro for
Small Hosts](https://github.com/element-hq/synapse-small-hosts/pull/173)
repo that sanity check all metrics to ensure that we can see each metric
includes data from multiple servers.
Bulk refactor `Histogram` metrics to be homeserver-scoped. We also add
lints to make sure that new `Histogram` metrics don't sneak in without
using the `server_name` label (`SERVER_NAME_LABEL`).
Part of https://github.com/element-hq/synapse/issues/18592
### Testing strategy
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
# This is just showing how to configure metrics either way
#
# `http` `metrics` resource
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
# `metrics` listener
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Observe response includes the TODO metrics with the `server_name`
label
### Todo
- [x] Wait for https://github.com/element-hq/synapse/pull/18656 to merge
### Dev notes
```
LoggingDatabaseConnection
make_conn
make_pool
make_fake_db_pool
```
### Pull Request Checklist
<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->
* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
- Use markdown where necessary, mostly for `code blocks`.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
Refactor `GaugeBucketCollector` metrics to be homeserver-scoped
Part of https://github.com/element-hq/synapse/issues/18592
### Testing strategy
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
# This is just showing how to configure metrics either way
#
# `http` `metrics` resource
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
# `metrics` listener
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Adjust the number of [`msecs` in the `looping_call` so that
`_read_forward_extremities`](a82b8a966a/synapse/storage/databases/main/metrics.py (L79))
runs immediately instead of after an hour.
1. Observe response includes the `synapse_forward_extremities` and
`synapse_excess_extremity_events` metrics with the `server_name` label
Bulk refactor `Gauge` metrics to be homeserver-scoped. We also add lints
to make sure that new `Gauge` metrics don't sneak in without using the
`server_name` label (`SERVER_NAME_LABEL`).
Part of https://github.com/element-hq/synapse/issues/18592
### Testing strategy
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
# This is just showing how to configure metrics either way
#
# `http` `metrics` resource
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
# `metrics` listener
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Observe response includes the TODO metrics with the `server_name`
label
### Pull Request Checklist
<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->
* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
- Use markdown where necessary, mostly for `code blocks`.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
Bulk refactor `Counter` metrics to be homeserver-scoped. We also add
lints to make sure that new `Counter` metrics don't sneak in without
using the `server_name` label (`SERVER_NAME_LABEL`).
All of the "Fill in" commits are just bulk refactor.
Part of https://github.com/element-hq/synapse/issues/18592
### Testing strategy
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
# This is just showing how to configure metrics either way
#
# `http` `metrics` resource
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
# `metrics` listener
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Observe response includes the `synapse_user_registrations_total`,
`synapse_http_server_response_count_total`, etc metrics with the
`server_name` label
Default values will be 1 room per minute, with a burst count of 10.
It's hard to imagine most users will be affected by this default rate,
but it's intentionally non-invasive in case of bots or other users that
need to create rooms at a large rate.
Server admins might want to down-tune this on their deployments.
---------
Signed-off-by: Olivier 'reivilibre <oliverw@matrix.org>
Part of https://github.com/element-hq/synapse/issues/18592
Separated out of https://github.com/element-hq/synapse/pull/18656
because it's a bigger, unique piece of the refactor
### Testing strategy
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
# This is just showing how to configure metrics either way
#
# `http` `metrics` resource
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
# `metrics` listener
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Observe response includes the background processs metrics
(`synapse_background_process_start_count`,
`synapse_background_process_db_txn_count_total`, etc) with the
`server_name` label
Spawning from getting `HMAC incorrect` errors that seem unexplainable
except for the `registration_shared_secret` being misconfigured. It's
also possible my HMAC calculation is incorrect but every time I
double-check the result with the [known-good Python
example](553e124f76/docs/admin_api/register_api.md)
(which matches [Synapse's
source](24e849e483/synapse/rest/admin/users.py (L618-L633))),
it's as expected.
With these logs, we can actually debug whether
`registration_shared_secret` is being configured correctly or not.
It also helps specifically when using `registration_shared_secret_path`
since the default Synapse behavior (of creating the file and secret if
it doesn't exist) can mask deployment race condition where we would
start up Synapse before the `registration_shared_secret_path` file was
put in place:
> **`registration_shared_secret_path`**
>
> [...]
>
> If this file does not exist, Synapse will create a new shared secret
on startup and store it in this file.
>
> *-- [Synapse config
docs](6521406a37/docs/usage/configuration/config_documentation.md (registration_shared_secret_path))*
This only applies to the [`POST
/_synapse/admin/v1/register`](553e124f76/docs/admin_api/register_api.md)
endpoint but does log very sensitive information so we've made it so you
have to explicitly enable the logs by configuring
`synapse.rest.admin.users.registration_debug` (does not inherit root log
level) (via our new `ExplicitlyConfiguredLogger`)
`homeserver.yaml`
```yaml
log_config: "/myserver.log.config.yaml"
```
`myserver.log.config.yaml`
```yaml
version: 1
formatters:
precise:
format: '%(asctime)s - %(name)s - %(lineno)d - %(levelname)s - %(request)s - %(message)s'
handlers:
# ... file/buffer handler (see `sample_log_config.yaml`)
# A handler that writes logs to stderr. Unused by default, but can be used
# instead of "buffer" and "file" in the logger handlers.
console:
class: logging.StreamHandler
formatter: precise
loggers:
synapse.storage.SQL:
# beware: increasing this to DEBUG will make synapse log sensitive
# information such as access tokens.
level: INFO
# Has to be explicitly configured as such. Will not inherit from the root level even if it's set to DEBUG
synapse.rest.admin.users.registration_debug:
level: DEBUG
root:
level: INFO
handlers: [console]
disable_existing_loggers: false
```
This introduces a dedicated API for MAS to consume. Companion PR on the
MAS side: element-hq/matrix-authentication-service#4801
This has a few advantages over the previous admin API:
- it works on workers (this will be documented once we stabilise MSC3861
as a whole)
- it is more efficient because more focused
- it propagates trace contexts from MAS
- it is only accessible to MAS (through the shared secret) and will let
us remove the weird hack that made this token 'admin' with a ghost
'@__oidc_admin:' user
The next MAS version should support it, but will be opt-in. The version
after that should use this new API by default
---------
Co-authored-by: Eric Eastwood <erice@element.io>
The main goal of this PR is to handle device list changes onto multiple
writers, off the main process, so that we can have logins happening
whilst Synapse is rolling-restarting.
This is quite an intrusive change, so I would advise to review this
commit by commit; I tried to keep the history as clean as possible.
There are a few things to consider:
- the `device_list_key` in stream tokens becomes a
`MultiWriterStreamToken`, which has a few implications in sync and on
the storage layer
- we had a split between `DeviceHandler` and `DeviceWorkerHandler` for
master vs. worker process. I've kept this split, but making it rather
writer vs. non-writer worker, using method overrides for doing
replication calls when needed
- there are a few operations that need to happen on a single worker at a
time. Instead of using cross-worker locks, for now I made them run on
the first writer on the list
---------
Co-authored-by: Eric Eastwood <erice@element.io>
Clean up `MetricsResource`, Prometheus hacks
(`_set_prometheus_client_use_created_metrics`), and better document why
we care about having a separate `metrics` listener type.
These clean-up changes have been split out from
https://github.com/element-hq/synapse/pull/18584 since that PR was
closed.
Refactor `Measure` block metrics to be homeserver-scoped (add
`server_name` label to block metrics).
Part of https://github.com/element-hq/synapse/issues/18592
### Testing strategy
#### See behavior of previous `metrics` listener
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9323/metrics`
1. Observe response includes the block metrics
(`synapse_util_metrics_block_count`,
`synapse_util_metrics_block_in_flight`, etc)
#### See behavior of the `http` `metrics` resource
1. Add the `metrics` resource to a new or existing `http` listeners in
your `homeserver.yaml`
```yaml
listeners:
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` (it's just a `GET`
request so you can even do in the browser)
1. Observe response includes the block metrics
(`synapse_util_metrics_block_count`,
`synapse_util_metrics_block_in_flight`, etc)
Another config option on my quest to a `*_path` variant for every
secret. Adds the config options `recaptcha_private_key_path` and
`recaptcha_public_key_path`. Tests and docs are included.
A public key is of course no secret, but it is closely related to the
private key, so it’s still useful to have a `*_path` variant for it.
You can now configure how much media can be uploaded by a user in a
given time period.
Note the first commit here is a refactor of create/upload content
function
This implements
https://github.com/matrix-org/matrix-spec-proposals/pull/3765 which is
already merged and, therefore, can use stable identifiers.
For `/publicRooms` and `/hierarchy`, the topic is read from the
eponymous field of the `current_state_events` table. Rather than
introduce further columns in this table, I changed the insertion /
update logic to write the plain-text topic from the rich topic into the
existing field. This will not take effect for existing rooms unless
their topic is changed. However, existing rooms shouldn't have rich
topics to begin with.
Similarly, for server-side search, I changed the insertion logic of the
`event_search` table to prefer the value from the rich topic. Again,
existing events shouldn't have rich topics and, therefore, don't need to
be migrated in the table.
Spec doc: https://spec.matrix.org/v1.15/client-server-api/#mroomtopic
Part of supporting Matrix v1.15:
https://spec.matrix.org/v1.15/client-server-api/#mroomtopic
Signed-off-by: Johannes Marbach <n0-0ne+github@mailbox.org>
Co-authored-by: Eric Eastwood <erice@element.io>
When a request gets ratelimited we (optionally) wait ~500ms before
returning to mitigate clients that like to tightloop on request
failures. However, this is currently implemented by pausing request
processing when we check for ratelimits, which might be deep within
request processing, and e.g. while locks are held. Instead, let's hoist
the pause to the very top of the HTTP handler.
Hopefully, this mitigates the issue where a user sending lots of events
to a single room can see their requests time out due to the combination
of the linearizer and the pausing of the request. Instead, they should
see the requests 429 after ~500ms.
The first commit is a refactor to pass the `Clock` to `AsyncResource`,
the second commit is the behavioural change.
We do this by shoving it into Rust. We believe our python http client is
a bit slow.
Also bumps minimum rust version to 1.81.0, released last September (over
six months ago)
To allow for async Rust, includes some adapters between Tokio in Rust
and the Twisted reactor in Python.
This was correctly handled for the "fallback" case where the background
updates hadn't finished
---------
Co-authored-by: Eric Eastwood <erice@element.io>
This can be reviewed commit by commit.
This enables the `flake8-logging` and `flake8-logging-format` rules in
Ruff, as well as logging exception stack traces in a few places where it
makes sense
- https://docs.astral.sh/ruff/rules/#flake8-logging-log
- https://docs.astral.sh/ruff/rules/#flake8-logging-format-g
### Linting to avoid pre-formatting log messages
See [`adamchainz/flake8-logging` -> *LOG011 avoid pre-formatting log
messages*](152db2f167/README.rst (log011-avoid-pre-formatting-log-messages))
Practically, this means prefer placeholders (`%s`) over f-strings for
logging.
This is because placeholders are passed as args to loggers, so they can
do special handling of them.
For example, Sentry will record the args separately in their logging
integration:
c15b390dfe/sentry_sdk/integrations/logging.py (L280-L284)
One theoretical small perf benefit is that log levels that aren't
enabled won't get formatted, so it doesn't unnecessarily create
formatted strings
This small PR migrates from `unittest.assertEquals` to
`unittest.assertEqual` which is deprecated from Python2.7:
```python
DeprecationWarning: Please use assertEqual instead.
```
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Fixes https://github.com/element-hq/synapse/issues/14240
This scratches an itch that i've had for years. We regularly run into
the issue where (especially in development) appservices can go down for
a period and them come back up. The ping endpoint was introduced some
time ago which means Synapse can determine if an AS is up more or less
immediately, so we might as well use that to schedule transaction
redelivery.
I believe transaction scheduling logic is largely implementation
specific, so we should be in the clear to do this without any spec
changes.
This implements
https://github.com/matrix-org/matrix-spec-proposals/pull/4155, which
adds support for a new account data type that blocks an invite based on
some conditions in the event contents.
---------
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
This PR adds an additional `room_config` argument to the
`user_may_create_room` spam checker module API callback.
It will continue to work with implementations of `user_may_create_room`
that do not expect the additional parameter.
A side affect is that on a room upgrade the spam checker callback is
called *after* doing some work to calculate the state rather than
before. However, I hope that this is acceptable given the relative
infrequency of room upgrades.
This PR addresses a test failure for
`tests.handlers.test_worker_lock.WorkerLockTestCase.test_lock_contention`
which consistently times out on the RISC-V (specifically `riscv64`)
architecture.
The test simulates high lock contention and has a default timeout of 5
seconds, which seems sufficient for architectures like x86_64 but proves
too short for current RISC-V hardware/environment performance
characteristics, leading to spurious `tests.utils.TestTimeout` failures.
This fix introduces architecture detection using `platform.machine()`.
If a RISC-V architecture is detected:
* The timeout for this specific test is increased (e.g., to 15 seconds
).
The original, stricter timeout (5 seconds) and lock count (500) are
maintained for all other architectures to avoid masking potential
performance regressions elsewhere.
This change has been tested locally on RISC-V, where the test now passes
reliably, and on x86_64, where it continues to pass with the original
constraints.
---
### Pull Request Checklist
<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->
* [X] Pull request is based on the develop branch *(Assuming you based
it correctly)*
* [X] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
*(See below)*
* [X] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
*(Please run linters locally)*