Regressed in https://github.com/element-hq/synapse/pull/18900#discussion_r2331554278 (see conversation there for more context) ### How is this a regression? > To give this an update with more hindsight; this logic *was* redundant with the early return and it is safe to remove this complexity ✅ > > It seems like this actually has to do with completed vs incomplete deferreds... > > To explain how things previously worked *without* the early-return shortcut: > > With the normal case of **incomplete awaitable**, we store the `calling_context` and the `f` function is called and runs until it yields to the reactor. Because `f` follows the logcontext rules, it sets the `sentinel` logcontext. Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is `sentinel`) and return. When the deferred completes, we restore `ctx` (which is `sentinel`) before yielding to the reactor again (all good ✅) > > With the other case where we see a **completed awaitable**, we store the `calling_context` and the `f` function is called and runs to completion (no logcontext change). *This is where the shortcut would kick in but I'm going to continue explaining as if we commented out the shortcut.* -- Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is same as the `calling_context`). Because the deferred is already completed, our extra callback is called immediately and we restore `ctx` (which is same as the `calling_context`). Since we never yield to the reactor, the `calling_context` is perfect as that's what we want again (all good ✅) > > --- > > But this also means that our early-return shortcut is no longer just an optimization and is *necessary* to act correctly in the **completed awaitable** case as we want to return with the `calling_context` and not reset to the `sentinel` context. I've updated the comment in https://github.com/element-hq/synapse/pull/18964 to explain the necessity as it's currently just described as an optimization. > > But because we made the same change to `run_coroutine_in_background(...)` which didn't have the same early-return shortcut, we regressed the correct behavior ❌ . This is being fixed in https://github.com/element-hq/synapse/pull/18964 > > > *-- @MadLittleMods, https://github.com/element-hq/synapse/pull/18900#discussion_r2373582917* ### How did we find this problem? Spawning from @wrjlewis [seeing](https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$h3TxxPVlqC6BTL07dbrsz6PmaUoZxLiXnSTEY-QYDtA?via=jki.re&via=matrix.org&via=element.io) `Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost` in the logs: <details> <summary>More logs</summary> ``` synapse.http.request_metrics - 222 - ERROR - sentinel - Trying to stop RequestMetrics in the sentinel context. 2025-09-23 14:43:19,712 - synapse.util.metrics - 212 - WARNING - sentinel - Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost 2025-09-23 14:43:19,713 - synapse.rest.client.sync - 851 - INFO - sentinel - Client has disconnected; not serializing response. 2025-09-23 14:43:19,713 - synapse.http.server - 825 - WARNING - sentinel - Not sending response to request <XForwardedForRequest at 0x7f23e8111ed0 method='POST' uri='/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=281963%2Fs929324_147053_10_2652457_147960_2013_25554_4709564_0_164_2&timeout=30000' clientproto='HTTP/1.1' site='8008'>, already dis connected. 2025-09-23 14:43:19,713 - synapse.access.http.8008 - 515 - INFO - sentinel - 92.40.194.87 - 8008 - {@me:wi11.co.uk} Processed request: 30.005sec/-8.041sec (0.001sec, 0.000sec) (0.000sec/0.002sec/2) 0B 200! "POST /_matrix/client/unstable/org.matrix.simplified_msc3575/ ``` </details> From the logs there, we can see things relating to `typing.get_new_events` and `/_matrix/client/unstable/org.matrix.simplified_msc3575/sync` which led me to trying out Sliding Sync with the typing extension enabled and allowed me to reproduce the problem locally. Sliding Sync is a unique scenario as it's the only place we use `gather_optional_coroutines(...)` -> `run_coroutine_in_background(...)` (introduced in https://github.com/element-hq/synapse/pull/17884) to exhibit this behavior. ### Testing strategy 1. Configure Synapse to enable [MSC4186](https://github.com/matrix-org/matrix-spec-proposals/pull/4186): Simplified Sliding Sync which is actually under [MSC3575](https://github.com/matrix-org/matrix-spec-proposals/pull/3575) ```yaml experimental_features: msc3575_enabled: true ``` 1. Start synapse: `poetry run synapse_homeserver --config-path homeserver.yaml` 1. Make a Sliding Sync request with one of the extensions enabled ```http POST http://localhost:8008/_matrix/client/unstable/org.matrix.simplified_msc3575/sync { "lists": {}, "room_subscriptions": { "!FlgJYGQKAIvAscfBhq:my.synapse.linux.server": { "required_state": [], "timeline_limit": 1 } }, "extensions": { "typing": { "enabled": true } } } ``` 1. Open your homeserver logs and notice warnings about `Starting ... from sentinel context: metrics will be lost`
500 lines
17 KiB
Python
500 lines
17 KiB
Python
#
|
|
# This file is licensed under the Affero General Public License (AGPL) version 3.
|
|
#
|
|
# Copyright 2014-2022 The Matrix.org Foundation C.I.C.
|
|
# Copyright (C) 2023 New Vector, Ltd
|
|
#
|
|
# This program is free software: you can redistribute it and/or modify
|
|
# it under the terms of the GNU Affero General Public License as
|
|
# published by the Free Software Foundation, either version 3 of the
|
|
# License, or (at your option) any later version.
|
|
#
|
|
# See the GNU Affero General Public License for more details:
|
|
# <https://www.gnu.org/licenses/agpl-3.0.html>.
|
|
#
|
|
# Originally licensed under the Apache License, Version 2.0:
|
|
# <http://www.apache.org/licenses/LICENSE-2.0>.
|
|
#
|
|
# [This file includes modifications made by New Vector Limited]
|
|
#
|
|
#
|
|
|
|
import logging
|
|
from typing import Callable, Generator, cast
|
|
|
|
from twisted.internet import defer, reactor as _reactor
|
|
|
|
from synapse.logging.context import (
|
|
SENTINEL_CONTEXT,
|
|
LoggingContext,
|
|
PreserveLoggingContext,
|
|
_Sentinel,
|
|
current_context,
|
|
make_deferred_yieldable,
|
|
nested_logging_context,
|
|
run_coroutine_in_background,
|
|
run_in_background,
|
|
)
|
|
from synapse.types import ISynapseReactor
|
|
from synapse.util.clock import Clock
|
|
|
|
from tests import unittest
|
|
from tests.unittest import logcontext_clean
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
reactor = cast(ISynapseReactor, _reactor)
|
|
|
|
|
|
class LoggingContextTestCase(unittest.TestCase):
|
|
def _check_test_key(self, value: str) -> None:
|
|
context = current_context()
|
|
assert isinstance(context, LoggingContext) or isinstance(context, _Sentinel), (
|
|
f"Expected LoggingContext({value}) but saw {context}"
|
|
)
|
|
self.assertEqual(
|
|
str(context), value, f"Expected LoggingContext({value}) but saw {context}"
|
|
)
|
|
|
|
@logcontext_clean
|
|
def test_with_context(self) -> None:
|
|
with LoggingContext("test"):
|
|
self._check_test_key("test")
|
|
|
|
@logcontext_clean
|
|
async def test_sleep(self) -> None:
|
|
"""
|
|
Test `Clock.sleep`
|
|
"""
|
|
clock = Clock(reactor)
|
|
|
|
# Sanity check that we start in the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
callback_finished = False
|
|
|
|
async def competing_callback() -> None:
|
|
nonlocal callback_finished
|
|
try:
|
|
# A callback from the reactor should start with the sentinel context. In
|
|
# other words, another task shouldn't have leaked their context to us.
|
|
self._check_test_key("sentinel")
|
|
|
|
with LoggingContext("competing"):
|
|
await clock.sleep(0)
|
|
self._check_test_key("competing")
|
|
|
|
self._check_test_key("sentinel")
|
|
finally:
|
|
# When exceptions happen, we still want to mark the callback as finished
|
|
# so that the test can complete and we see the underlying error.
|
|
callback_finished = True
|
|
|
|
reactor.callLater(0, lambda: defer.ensureDeferred(competing_callback()))
|
|
|
|
with LoggingContext("foo"):
|
|
await clock.sleep(0)
|
|
self._check_test_key("foo")
|
|
await clock.sleep(0)
|
|
self._check_test_key("foo")
|
|
|
|
self.assertTrue(
|
|
callback_finished,
|
|
"Callback never finished which means the test probably didn't wait long enough",
|
|
)
|
|
|
|
# Back to the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
@logcontext_clean
|
|
async def test_looping_call(self) -> None:
|
|
"""
|
|
Test `Clock.looping_call`
|
|
"""
|
|
clock = Clock(reactor)
|
|
|
|
# Sanity check that we start in the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
callback_finished = False
|
|
|
|
async def competing_callback() -> None:
|
|
nonlocal callback_finished
|
|
try:
|
|
# A `looping_call` callback should have *some* logcontext since we should know
|
|
# which server spawned this loop and which server the logs came from.
|
|
self._check_test_key("looping_call")
|
|
|
|
with LoggingContext("competing"):
|
|
await clock.sleep(0)
|
|
self._check_test_key("competing")
|
|
|
|
self._check_test_key("looping_call")
|
|
finally:
|
|
# When exceptions happen, we still want to mark the callback as finished
|
|
# so that the test can complete and we see the underlying error.
|
|
callback_finished = True
|
|
|
|
with LoggingContext("foo"):
|
|
lc = clock.looping_call(
|
|
lambda: defer.ensureDeferred(competing_callback()), 0
|
|
)
|
|
self._check_test_key("foo")
|
|
await clock.sleep(0)
|
|
self._check_test_key("foo")
|
|
await clock.sleep(0)
|
|
self._check_test_key("foo")
|
|
|
|
self.assertTrue(
|
|
callback_finished,
|
|
"Callback never finished which means the test probably didn't wait long enough",
|
|
)
|
|
|
|
# Back to the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
# Stop the looping call to prevent "Reactor was unclean" errors
|
|
lc.stop()
|
|
|
|
@logcontext_clean
|
|
async def test_looping_call_now(self) -> None:
|
|
"""
|
|
Test `Clock.looping_call_now`
|
|
"""
|
|
clock = Clock(reactor)
|
|
|
|
# Sanity check that we start in the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
callback_finished = False
|
|
|
|
async def competing_callback() -> None:
|
|
nonlocal callback_finished
|
|
try:
|
|
# A `looping_call` callback should have *some* logcontext since we should know
|
|
# which server spawned this loop and which server the logs came from.
|
|
self._check_test_key("looping_call")
|
|
|
|
with LoggingContext("competing"):
|
|
await clock.sleep(0)
|
|
self._check_test_key("competing")
|
|
|
|
self._check_test_key("looping_call")
|
|
finally:
|
|
# When exceptions happen, we still want to mark the callback as finished
|
|
# so that the test can complete and we see the underlying error.
|
|
callback_finished = True
|
|
|
|
with LoggingContext("foo"):
|
|
lc = clock.looping_call_now(
|
|
lambda: defer.ensureDeferred(competing_callback()), 0
|
|
)
|
|
self._check_test_key("foo")
|
|
await clock.sleep(0)
|
|
self._check_test_key("foo")
|
|
|
|
self.assertTrue(
|
|
callback_finished,
|
|
"Callback never finished which means the test probably didn't wait long enough",
|
|
)
|
|
|
|
# Back to the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
# Stop the looping call to prevent "Reactor was unclean" errors
|
|
lc.stop()
|
|
|
|
@logcontext_clean
|
|
async def test_call_later(self) -> None:
|
|
"""
|
|
Test `Clock.call_later`
|
|
"""
|
|
clock = Clock(reactor)
|
|
|
|
# Sanity check that we start in the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
callback_finished = False
|
|
|
|
async def competing_callback() -> None:
|
|
nonlocal callback_finished
|
|
try:
|
|
# A `call_later` callback should have *some* logcontext since we should know
|
|
# which server spawned this loop and which server the logs came from.
|
|
self._check_test_key("call_later")
|
|
|
|
with LoggingContext("competing"):
|
|
await clock.sleep(0)
|
|
self._check_test_key("competing")
|
|
|
|
self._check_test_key("call_later")
|
|
finally:
|
|
# When exceptions happen, we still want to mark the callback as finished
|
|
# so that the test can complete and we see the underlying error.
|
|
callback_finished = True
|
|
|
|
with LoggingContext("foo"):
|
|
clock.call_later(0, lambda: defer.ensureDeferred(competing_callback()))
|
|
self._check_test_key("foo")
|
|
await clock.sleep(0)
|
|
self._check_test_key("foo")
|
|
await clock.sleep(0)
|
|
self._check_test_key("foo")
|
|
|
|
self.assertTrue(
|
|
callback_finished,
|
|
"Callback never finished which means the test probably didn't wait long enough",
|
|
)
|
|
|
|
# Back to the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
async def _test_run_in_background(self, function: Callable[[], object]) -> None:
|
|
clock = Clock(reactor)
|
|
|
|
# Sanity check that we start in the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
callback_finished = False
|
|
|
|
with LoggingContext("foo"):
|
|
# Fire off the function, but don't wait on it.
|
|
deferred = run_in_background(function)
|
|
self._check_test_key("foo")
|
|
|
|
def callback(result: object) -> object:
|
|
nonlocal callback_finished
|
|
callback_finished = True
|
|
# Pass through the result
|
|
return result
|
|
|
|
# We `addBoth` because when exceptions happen, we still want to mark the
|
|
# callback as finished so that the test can complete and we see the
|
|
# underlying error.
|
|
deferred.addBoth(callback)
|
|
|
|
self._check_test_key("foo")
|
|
|
|
# Now wait for the function under test to have run, and check that
|
|
# the logcontext is left in a sane state.
|
|
while not callback_finished:
|
|
await clock.sleep(0)
|
|
self._check_test_key("foo")
|
|
|
|
self.assertTrue(
|
|
callback_finished,
|
|
"Callback never finished which means the test probably didn't wait long enough",
|
|
)
|
|
|
|
# Back to the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
@logcontext_clean
|
|
async def test_run_in_background_with_blocking_fn(self) -> None:
|
|
async def blocking_function() -> None:
|
|
await Clock(reactor).sleep(0)
|
|
|
|
await self._test_run_in_background(blocking_function)
|
|
|
|
@logcontext_clean
|
|
async def test_run_in_background_with_non_blocking_fn(self) -> None:
|
|
@defer.inlineCallbacks
|
|
def nonblocking_function() -> Generator["defer.Deferred[object]", object, None]:
|
|
with PreserveLoggingContext():
|
|
yield defer.succeed(None)
|
|
|
|
await self._test_run_in_background(nonblocking_function)
|
|
|
|
@logcontext_clean
|
|
async def test_run_in_background_with_chained_deferred(self) -> None:
|
|
# a function which returns a deferred which looks like it has been
|
|
# called, but is actually paused
|
|
def testfunc() -> defer.Deferred:
|
|
return make_deferred_yieldable(_chained_deferred_function())
|
|
|
|
await self._test_run_in_background(testfunc)
|
|
|
|
@logcontext_clean
|
|
async def test_run_in_background_with_coroutine(self) -> None:
|
|
"""
|
|
Test `run_in_background` with a coroutine that yields control back to the
|
|
reactor.
|
|
|
|
This will stress the logic around incomplete deferreds in `run_in_background`.
|
|
"""
|
|
|
|
async def testfunc() -> None:
|
|
self._check_test_key("foo")
|
|
d = defer.ensureDeferred(Clock(reactor).sleep(0))
|
|
self.assertIs(current_context(), SENTINEL_CONTEXT)
|
|
await d
|
|
self._check_test_key("foo")
|
|
|
|
await self._test_run_in_background(testfunc)
|
|
|
|
@logcontext_clean
|
|
async def test_run_in_background_with_nonblocking_coroutine(self) -> None:
|
|
"""
|
|
Test `run_in_background` with a "nonblocking" coroutine (never yields control
|
|
back to the reactor).
|
|
|
|
This will stress the logic around completed deferreds in `run_in_background`.
|
|
"""
|
|
|
|
async def testfunc() -> None:
|
|
self._check_test_key("foo")
|
|
|
|
await self._test_run_in_background(testfunc)
|
|
|
|
@logcontext_clean
|
|
async def test_run_coroutine_in_background(self) -> None:
|
|
"""
|
|
Test `run_coroutine_in_background` with a coroutine that yields control back to the
|
|
reactor.
|
|
|
|
This will stress the logic around incomplete deferreds in `run_coroutine_in_background`.
|
|
"""
|
|
clock = Clock(reactor)
|
|
|
|
# Sanity check that we start in the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
callback_finished = False
|
|
|
|
async def competing_callback() -> None:
|
|
nonlocal callback_finished
|
|
try:
|
|
# The callback should have the same logcontext as the caller
|
|
self._check_test_key("foo")
|
|
|
|
with LoggingContext("competing"):
|
|
await clock.sleep(0)
|
|
self._check_test_key("competing")
|
|
|
|
self._check_test_key("foo")
|
|
finally:
|
|
# When exceptions happen, we still want to mark the callback as finished
|
|
# so that the test can complete and we see the underlying error.
|
|
callback_finished = True
|
|
|
|
with LoggingContext("foo"):
|
|
run_coroutine_in_background(competing_callback())
|
|
self._check_test_key("foo")
|
|
await clock.sleep(0)
|
|
self._check_test_key("foo")
|
|
|
|
self.assertTrue(
|
|
callback_finished,
|
|
"Callback never finished which means the test probably didn't wait long enough",
|
|
)
|
|
|
|
# Back to the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
@logcontext_clean
|
|
async def test_run_coroutine_in_background_with_nonblocking_coroutine(self) -> None:
|
|
"""
|
|
Test `run_coroutine_in_background` with a "nonblocking" coroutine (never yields control
|
|
back to the reactor).
|
|
|
|
This will stress the logic around completed deferreds in `run_coroutine_in_background`.
|
|
"""
|
|
# Sanity check that we start in the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
callback_finished = False
|
|
|
|
async def competing_callback() -> None:
|
|
nonlocal callback_finished
|
|
try:
|
|
# The callback should have the same logcontext as the caller
|
|
self._check_test_key("foo")
|
|
|
|
with LoggingContext("competing"):
|
|
# We `await` here but there is nothing to wait for here since the
|
|
# deferred is already complete so we should immediately continue
|
|
# executing in the same context.
|
|
await defer.succeed(None)
|
|
|
|
self._check_test_key("competing")
|
|
|
|
self._check_test_key("foo")
|
|
finally:
|
|
# When exceptions happen, we still want to mark the callback as finished
|
|
# so that the test can complete and we see the underlying error.
|
|
callback_finished = True
|
|
|
|
with LoggingContext("foo"):
|
|
run_coroutine_in_background(competing_callback())
|
|
self._check_test_key("foo")
|
|
|
|
self.assertTrue(
|
|
callback_finished,
|
|
"Callback never finished which means the test probably didn't wait long enough",
|
|
)
|
|
|
|
# Back to the sentinel context
|
|
self._check_test_key("sentinel")
|
|
|
|
@logcontext_clean
|
|
@defer.inlineCallbacks
|
|
def test_make_deferred_yieldable(
|
|
self,
|
|
) -> Generator["defer.Deferred[object]", object, None]:
|
|
# a function which returns an incomplete deferred, but doesn't follow
|
|
# the synapse rules.
|
|
def blocking_function() -> defer.Deferred:
|
|
d: defer.Deferred = defer.Deferred()
|
|
reactor.callLater(0, d.callback, None)
|
|
return d
|
|
|
|
sentinel_context = current_context()
|
|
|
|
with LoggingContext("foo"):
|
|
d1 = make_deferred_yieldable(blocking_function())
|
|
# make sure that the context was reset by make_deferred_yieldable
|
|
self.assertIs(current_context(), sentinel_context)
|
|
|
|
yield d1
|
|
|
|
# now it should be restored
|
|
self._check_test_key("foo")
|
|
|
|
@logcontext_clean
|
|
@defer.inlineCallbacks
|
|
def test_make_deferred_yieldable_with_chained_deferreds(
|
|
self,
|
|
) -> Generator["defer.Deferred[object]", object, None]:
|
|
sentinel_context = current_context()
|
|
|
|
with LoggingContext("foo"):
|
|
d1 = make_deferred_yieldable(_chained_deferred_function())
|
|
# make sure that the context was reset by make_deferred_yieldable
|
|
self.assertIs(current_context(), sentinel_context)
|
|
|
|
yield d1
|
|
|
|
# now it should be restored
|
|
self._check_test_key("foo")
|
|
|
|
@logcontext_clean
|
|
def test_nested_logging_context(self) -> None:
|
|
with LoggingContext("foo"):
|
|
nested_context = nested_logging_context(suffix="bar")
|
|
self.assertEqual(nested_context.name, "foo-bar")
|
|
|
|
|
|
# a function which returns a deferred which has been "called", but
|
|
# which had a function which returned another incomplete deferred on
|
|
# its callback list, so won't yet call any other new callbacks.
|
|
def _chained_deferred_function() -> defer.Deferred:
|
|
d = defer.succeed(None)
|
|
|
|
def cb(res: object) -> defer.Deferred:
|
|
d2: defer.Deferred = defer.Deferred()
|
|
reactor.callLater(0, d2.callback, res)
|
|
return d2
|
|
|
|
d.addCallback(cb)
|
|
return d
|