[PATCH] t0610: work around flaky test with concurrent writers

Patrick Steinhardt <ps@xxxxxx> · Fri, 4 Oct 2024 14:16:45 +0200

In 6241ce2170 (refs/reftable: reload locked stack when preparing
transaction, 2024-09-24) we have introduced a new test that exercises
how the reftable backend behaves with many concurrent writers all racing
with each other. This test was introduced after a couple of fixes in
this context that should make concurrent writes behave gracefully. As it
turns out though, Windows systems do not yet handle concurrent writes
properly, as we've got two reports for Cygwin and MinGW failing in this
newly added test.

The root cause of this is how we update the "tables.list" file: when
writing a new stack of tables we first write the data into a lockfile
and then rename that file into place. But Windows forbids us from doing
that rename when the target path is open for reading by another process.
And as the test races both readers and writers with each other we are
quite likely to hit this edge case.

Now the two reports are somewhat different from one another:

  - On Cygwin we hit timeouts because we fail to lock the "tables.list"
    file within 10 seconds. The renames themselves succeed even when the
    target file is open because Cygwin provides extensive compatibility
    logic to make them work even when the target file is open already.

  - On MinGW we hit I/O errors on rename. While we do have some retry
    logic in place to make the rename work in some cases, this is
    seemingly not sufficient when there is this much contention around
    the files.

Neither of these cases is a regression: the logic didn't work before the
mentioned commit, and after the commit it performs well on Linux, macOS
and in Cygwin, and at least a bit better with MinGW. But the tests show
that we need to put more thought into how to make this work properly on
MinGW systems.

The fact that Cygwin can work around this issue with better emulation of
POSIX-style atomic renames shows that we can in theory make MinGW work
better, as well. But doing so likely requires quite some fiddling with
Windows internals, and Git v2.47 is about to be released in a couple
days. This makes any potential fix quite risky as it would have to
happen deep down in our rename(3P) implementation in "compat/mingw.c".

Let's instead work around both issues by disabling the test on MinGW
and by significantly increasing the locking timeout for Cygwin. This
bumped timeout also helps when running with e.g. the address and memory
sanitizers, which also tend to significantly extend the runtime of this
test.

This should be revisited after Git v2.47 is out.

Signed-off-by: Patrick Steinhardt <ps@xxxxxx>
---

This fix can be applied to remove some of the stress with the Git v2.47
release pending. If would of course be preferable to find an alternate
fix that makes MinGW work as required, but if you take the 500 lines of
code that is the rename(3P) implemenation of Cygwin as a hint you
quickly figure out that this is a rather complex problem.

Patrick

 t/t0610-reftable-basics.sh | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
index 2d951c8ceb..86a746aff0 100755
--- a/t/t0610-reftable-basics.sh
+++ b/t/t0610-reftable-basics.sh
@@ -450,15 +450,27 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' '
 	)
 '
 
-test_expect_success 'ref transaction: many concurrent writers' '
+# This test fails most of the time on Windows systems. The root cause is
+# that Windows does not allow us to rename the "tables.list.lock" file into
+# place when "tables.list" is open for reading by a concurrent process.
+#
+# The same issue does not happen on Cygwin because its implementation of
+# rename(3P) is emulating POSIX-style renames, including renames over files
+# that are open.
+test_expect_success !MINGW 'ref transaction: many concurrent writers' '
 	test_when_finished "rm -rf repo" &&
 	git init repo &&
 	(
 		cd repo &&
-		# Set a high timeout such that a busy CI machine will not abort
-		# early. 10 seconds should hopefully be ample of time to make
-		# this non-flaky.
-		git config set reftable.lockTimeout 10000 &&
+		# Set a high timeout. While a couple of seconds should be
+		# plenty, using the address sanitizer will significantly slow
+		# us down here. Furthermore, Cygwin is also way slower due to
+		# the POSIX-style rename emulation. So we are aiming way higher
+		# than you would ever think is necessary just to keep us from
+		# flaking. We could also lock indefinitely by passing -1, but
+		# that could potentially block CI jobs indefinitely if there
+		# was a bug here.
+		git config set reftable.lockTimeout 300000 &&
 		test_commit --no-tag initial &&
 
 		head=$(git rev-parse HEAD) &&

base-commit: 111e864d69c84284441b083966c2065c2e9a4e78
-- 
2.47.0.rc0.dirty