Re: v2.47.0-rc1 test failure on cygwin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Oct 04, 2024 at 05:59:30AM +0200, Patrick Steinhardt wrote:
> On Fri, Oct 04, 2024 at 02:02:44AM +0100, Ramsay Jones wrote:
> > Hi Patrick,
> > 
> > Just a quick heads up: t0610-reftable-basics.sh test 47 (ref transaction: many
> > concurrent writers) fails on cygwin. The tail end of the debug output for this
> > test looks like:
> > 
> [snip]
> > 
> > t0610-reftable-basics.sh passed on 'rc0', but this test (and the timeout facility)
> > is new in 'rc1'. I tried simply increasing the timeout (10 fold), but that didn't
> > change the result. (I didn't really expect it to - the 'reftable: transaction
> > prepare: I/O error' does not look timing related!).
> > 
> > Again, just a heads up. (I can't look at it until tomorrow now; any ideas?)
> 
> This failure is kind of known and discussed in [1]. Just to make it
> explicit: this test failure doesn't really surface a regression, the
> reftable code already failed for concurrent writes before. I fixed that
> and added the test that is now flaky, as the fix itself is seemingly
> only sufficient on Linux and macOS.
> 
> I didn't yet have the time to look at whether I can fix it, but should
> finally find the time to do so today.

Hm, interestingly enough I cannot reproduce the issue on Cygwin myself,
but I can reproduce the issue with MinGW. And in fact, the logs you have
sent all indicate that we cannot acquire the lock, there is no sign of
I/O errors here. So I guess you're running into timeout issues. Does the
following patch fix this for you?

diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
index 2d951c8ceb..b5cad805d4 100755
--- a/t/t0610-reftable-basics.sh
+++ b/t/t0610-reftable-basics.sh
@@ -455,10 +455,7 @@ test_expect_success 'ref transaction: many concurrent writers' '
 	git init repo &&
 	(
 		cd repo &&
-		# Set a high timeout such that a busy CI machine will not abort
-		# early. 10 seconds should hopefully be ample of time to make
-		# this non-flaky.
-		git config set reftable.lockTimeout 10000 &&
+		git config set reftable.lockTimeout -1 &&
 		test_commit --no-tag initial &&
 
 		head=$(git rev-parse HEAD) &&

The issue on Win32 is different: we cannot commit the "tables.list" lock
via rename(3P) because the target file may be open for reading by a
concurrent process. I guess that Cygwin has proper POSIX semantics for
rename(3P) and thus doesn't hit the same issue.

We already try to emulate POSIX semantics somewhat in `mingw_rename()`
by using a retry-loop when we hit `ERROR_ACCESS_DENIED`, which is what
we get when the target file is open in another process. But that
seemingly isn't enough when there is a lot of contention around a file.
So I'm currently investigating whether we can adopt something similar to
what Cygwin is doing for Win32, as well. I assume that they use
`FILE_RENAME_INFORMATION_EX` with `FILE_RENAME_POSIX_SEMANTICS`, which
should give us what we're looking for.

gh, well. Turns out the implementation of rename(3P) in Cygwin is 500
lines long. I guess this is a non-trivial problem :) But they of course
have to handle a whole lot more cases than we have to. But my guess was
correct: they do use `FILE_RENAME_POSIX_SEMANTICS`. The catch is that
this flag only exists in Windows 10 and newer. But that should be a fine
compromise.

I'll try to wrap my head around how all of this works.

Patrick




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux