In 6241ce2170 (refs/reftable: reload locked stack when preparing transaction, 2024-09-24) we have introduced a new test that exercises how the reftable backend behaves with many concurrent writers all racing with each other. This test was introduced after a couple of fixes in this context that should make concurrent writes behave gracefully. As it turns out though, Windows systems do not yet handle concurrent writes properly, as we've got two reports for Cygwin and MinGW failing in this newly added test. The root cause of this is how we update the "tables.list" file: when writing a new stack of tables we first write the data into a lockfile and then rename that file into place. But Windows forbids us from doing that rename when the target path is open for reading by another process. And as the test races both readers and writers with each other we are quite likely to hit this edge case. Now the two reports are somewhat different from one another: - On Cygwin we hit timeouts because we fail to lock the "tables.list" file within 10 seconds. The renames themselves succeed even when the target file is open because Cygwin provides extensive compatibility logic to make them work even when the target file is open already. - On MinGW we hit I/O errors on rename. While we do have some retry logic in place to make the rename work in some cases, this is seemingly not sufficient when there is this much contention around the files. Neither of these cases is a regression: the logic didn't work before the mentioned commit, and after the commit it performs well on Linux, macOS and in Cygwin, and at least a bit better with MinGW. But the tests show that we need to put more thought into how to make this work properly on MinGW systems. The fact that Cygwin can work around this issue with better emulation of POSIX-style atomic renames shows that we can in theory make MinGW work better, as well. But doing so likely requires quite some fiddling with Windows internals, and Git v2.47 is about to be released in a couple days. This makes any potential fix quite risky as it would have to happen deep down in our rename(3P) implementation in "compat/mingw.c". Let's instead work around both issues by disabling the test on MinGW and by significantly increasing the locking timeout for Cygwin. This bumped timeout also helps when running with e.g. the address and memory sanitizers, which also tend to significantly extend the runtime of this test. This should be revisited after Git v2.47 is out. Signed-off-by: Patrick Steinhardt <ps@xxxxxx> --- This fix can be applied to remove some of the stress with the Git v2.47 release pending. If would of course be preferable to find an alternate fix that makes MinGW work as required, but if you take the 500 lines of code that is the rename(3P) implemenation of Cygwin as a hint you quickly figure out that this is a rather complex problem. Patrick t/t0610-reftable-basics.sh | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh index 2d951c8ceb..86a746aff0 100755 --- a/t/t0610-reftable-basics.sh +++ b/t/t0610-reftable-basics.sh @@ -450,15 +450,27 @@ test_expect_success 'ref transaction: retry acquiring tables.list lock' ' ) ' -test_expect_success 'ref transaction: many concurrent writers' ' +# This test fails most of the time on Windows systems. The root cause is +# that Windows does not allow us to rename the "tables.list.lock" file into +# place when "tables.list" is open for reading by a concurrent process. +# +# The same issue does not happen on Cygwin because its implementation of +# rename(3P) is emulating POSIX-style renames, including renames over files +# that are open. +test_expect_success !MINGW 'ref transaction: many concurrent writers' ' test_when_finished "rm -rf repo" && git init repo && ( cd repo && - # Set a high timeout such that a busy CI machine will not abort - # early. 10 seconds should hopefully be ample of time to make - # this non-flaky. - git config set reftable.lockTimeout 10000 && + # Set a high timeout. While a couple of seconds should be + # plenty, using the address sanitizer will significantly slow + # us down here. Furthermore, Cygwin is also way slower due to + # the POSIX-style rename emulation. So we are aiming way higher + # than you would ever think is necessary just to keep us from + # flaking. We could also lock indefinitely by passing -1, but + # that could potentially block CI jobs indefinitely if there + # was a bug here. + git config set reftable.lockTimeout 300000 && test_commit --no-tag initial && head=$(git rev-parse HEAD) && base-commit: 111e864d69c84284441b083966c2065c2e9a4e78 -- 2.47.0.rc0.dirty