synapse/changelog.d
Eric Eastwood ff03a51cb0
Revert "Fix LaterGauge metrics to collect from all servers (#18751)" (#18789)
This PR reverts https://github.com/element-hq/synapse/pull/18751

### Why revert?

@reivilibre
[found](https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$u9OEmMxaFYUzWHhCk1A_r50Y0aGrtKEhepF7WxWJkUA?via=matrix.org&via=node.marinchik.ink&via=element.io)
that our CI was failing in bizarre ways (thanks for stepping up to dive
into this 🙇). Examples:

- `twisted.internet.error.ProcessTerminated: A process has ended with a
probable error condition: process ended by signal 9.`
- `twisted.internet.error.ProcessTerminated: A process has ended with a
probable error condition: process ended by signal 15.`

<details>
<summary>More detailed part of the log</summary>


https://github.com/element-hq/synapse/actions/runs/16758038107/job/47500520633#step:9:6809
```
tests.util.test_wheel_timer.WheelTimerTestCase.test_single_insert_fetch
===============================================================================
Error: 
Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/disttrial.py", line 371, in task
    await worker.run(case, result)
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/worker.py", line 305, in run
    return await self.callRemote(workercommands.Run, testCase=testCaseId)  # type: ignore[no-any-return]
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1187, in __iter__
    yield self
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/protocols/amp.py", line 1968, in _massageError
    error.trap(RemoteAmpError)
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 431, in trap
    self.raiseException()
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 455, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 9.

tests.util.test_macaroons.MacaroonGeneratorTestCase.test_guest_access_token
-------------------------------------------------------------------------------
Ran 4325 tests in 669.321s

FAILED (skips=159, errors=62, successes=4108)
while calling from thread
Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/base.py", line 1064, in runUntilCurrent
    f(*a, **kw)
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/base.py", line 790, in stop
    raise error.ReactorNotRunning("Can't stop reactor that isn't running.")
twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running.

joining disttrial worker #0 failed
Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1853, in _inlineCallbacks
    result = context.run(
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/worker.py", line 406, in exit
    await endDeferred
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1187, in __iter__
    yield self
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 15.
```

</details>


With more debugging (thanks @devonh for also stepping in as maintainer),
we were finding that the CI was consistently failing at
`test_exposed_to_prometheus` which was a bit of smoke because of all of
the [metrics
changes](https://github.com/element-hq/synapse/issues/18592) that were
merged recently.

Locally, although I wasn't able to reproduce the bizarre errors, I could
easily see increased memory usage (~20GB vs ~2GB) and the
`test_exposed_to_prometheus` test taking a while to complete when
running a full test run (`SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial
tests`).

<img width="1485" height="78" alt="Lots of memory usage"
src="https://github.com/user-attachments/assets/811e2a96-75e5-4a3c-966c-00dc0512cea9"
/>

After updating `test_exposed_to_prometheus` to dump the
`latest_metrics_response = generate_latest(REGISTRY)`, I could see that
it's a massive 3.2GB response. Inspecting the contents, we can see 4.1M
(4,137,123) entries for just
`synapse_background_update_status{server_name="test"} 3.0` which is a
`LaterGauge`. I don't think we have 4.1M test cases so it's also unclear
why we end up with so many samples but it does make sense that we do see
a lot of duplicates because each `HomeserverTestCase` will create a
homeserver for each test case that will `LaterGauge.register_hook(...)`
(part of the https://github.com/element-hq/synapse/pull/18751 changes).

`tests/storage/databases/main/test_metrics.py`
```python
        latest_metrics_response = generate_latest(REGISTRY)
        with open("/tmp/synapse-test-metrics", "wb") as f:
            f.write(latest_metrics_response)
```

After reverting the https://github.com/element-hq/synapse/pull/18751
changes, running the full test suite locally doesn't result in memory
spikes and seems to run normally.



### Dev notes

Discussion in the
[`#synapse-dev:matrix.org`](https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$vkMATs04yqZggVVd6Noop5nU8M2DVoTkrAWshw7u1-w?via=matrix.org&via=node.marinchik.ink&via=element.io)
room.

### Pull Request Checklist

<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->

* [x] Pull request is based on the develop branch
* [ ] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
  - Use markdown where necessary, mostly for `code blocks`.
  - End with either a period (.) or an exclamation mark (!).
  - Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [ ] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
2025-08-06 22:14:40 +00:00
..
.gitignore
18474.misc Add debug log when HMAC incorrect (#18474) 2025-07-22 11:09:45 -05:00
18514.feature Add configurable rate limiting for the creation of rooms. (#18514) 2025-07-24 14:08:02 +00:00
18540.feature Add support for MSC4293 - Redact on Kick/Ban (#18540) 2025-07-23 16:00:01 +01:00
18574.misc Make room upgrades faster for rooms with many bans (#18574) 2025-08-04 10:42:52 +01:00
18580.misc Use UTF-8 for config doc generation (#18580) 2025-07-22 17:54:47 +00:00
18585.feature Allow admins to see policy server-flagged events (#18585) 2025-07-29 19:57:33 +01:00
18656.misc Refactor Counter metrics to be homeserver-scoped (#18656) 2025-07-25 14:58:47 -05:00
18670.misc Refactor background process metrics to be homeserver-scoped (#18670) 2025-07-23 13:28:17 -05:00
18686.feature Configure HTTP proxy in file config (#18686) 2025-07-22 10:33:00 -05:00
18696.bugfix Allow return code 403 when fetching profile via federation (#18696) 2025-07-22 18:42:50 +01:00
18700.doc Minor improvements to README.rst (#18700) 2025-07-30 15:07:10 +01:00
18714.misc Refactor LaterGauge metrics to be homeserver-scoped (#18714) 2025-07-29 13:49:41 -05:00
18715.misc Refactor GaugeBucketCollector metrics to be homeserver-scoped (#18715) 2025-07-29 11:46:21 -05:00
18718.misc Reduce database usage in Sliding Sync by not querying for background update completion after the update is known to be complete. (#18718) 2025-07-24 14:58:39 +00:00
18722.feature MSC4306: expose feature in the client version (#18722) 2025-07-29 13:39:11 -05:00
18723.misc Improve order of validation and ratelimiting in room creation (#18723) 2025-08-04 11:08:02 -05:00
18724.misc Refactor Histogram metrics to be homeserver-scoped (#18724) 2025-07-29 15:35:38 -05:00
18725.misc Refactor Gauge metrics to be homeserver-scoped (#18725) 2025-07-29 10:37:59 -05:00
18726.bugfix MSC4306: register the thread subscriptions servlet in the client servlet section (#18726) 2025-07-24 10:33:34 +00:00
18727.misc Bump minimum version bound on Twisted to 21.2.0. (#18727) 2025-07-24 15:39:54 +01:00
18728.misc Use twisted.internet.testing module in tests instead of deprecated twisted.test.proto_helpers. (#18728) 2025-07-30 12:32:10 +01:00
18729.misc Remove some obsolete Twisted version checks. (#18729) 2025-07-30 12:31:55 +01:00
18730.misc Remove obsolete /send_event replication endpoint. (#18730) 2025-07-30 12:30:40 +01:00
18733.misc Update metrics linting to be able to handle custom metrics (#18733) 2025-08-01 15:34:11 -05:00
18736.misc Work around twisted.protocols.amp.TooLong error by reducing logging in some tests. (#18736) 2025-07-30 12:03:56 +00:00
18737.removal Resolve breaking change to run_as_background_process in module API (#18737) 2025-07-29 14:29:38 -05:00
18748.misc Fix cache metrics to collect from all servers (#18748) 2025-08-01 12:29:58 -05:00
18750.bugfix Allow suspended users to be auto-joined to server notice rooms (#18750) 2025-07-30 15:38:07 +00:00
18753.misc Fix Failed to stop metrics warnings in request metrics (#18753) 2025-07-31 10:31:45 -05:00
18755.misc Prevent "Move labelled issues to correct projects" GitHub Actions workflow from failing when an issue is already on the project board (#18755) 2025-08-05 12:03:25 +01:00
18756.misc Update implementation of MSC4306: Thread Subscriptions to include automatic subscription conflict prevention as introduced in later drafts. (#18756) 2025-08-05 18:22:53 +00:00
18757.misc Bump minimum supported rust version to 1.82.0 (#18757) 2025-08-05 12:02:57 +01:00
18759.feature Stabilise MAS integration (#18759) 2025-08-04 15:48:45 +02:00
18760.doc.md Document that there can be multiple workers handling the receipts stream (#18760) 2025-08-04 13:23:15 +01:00
18761.doc.md Improve device lists documentation (#18761) 2025-08-04 13:19:34 +01:00
18762.feature Implement the push rules for experimental MSC4306: Thread Subscriptions. (#18762) 2025-08-06 15:33:52 +01:00
18763.bugfix Add missing await to sleep calls (#18763) 2025-08-01 16:00:30 +01:00
18772.misc Make .sleep(..) return a coroutine (#18772) 2025-08-05 09:30:52 +01:00