Re: v2.47.0-rc1 test failure on cygwin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Oct 04, 2024 at 01:11:15PM +0200, Johannes Schindelin wrote:
> Hi Patrick,
> > > > The issue on Win32 is different: we cannot commit the "tables.list" lock
> > > > via rename(3P) because the target file may be open for reading by a
> > > > concurrent process. I guess that Cygwin has proper POSIX semantics for
> > > > rename(3P) and thus doesn't hit the same issue.
> > >
> > > Indeed, this is where the symptom lies. I worked around it in Git for
> > > Windows v2.47.0-rc1 via this patch:
> > >
> > > -- snipsnap --
> > > diff --git a/compat/mingw.c b/compat/mingw.c
> > > index c1030e40f227..56db193d1360 100644
> > > --- a/compat/mingw.c
> > > +++ b/compat/mingw.c
> > > @@ -784,6 +784,7 @@ int mingw_open (const char *filename, int oflags, ...)
> > >  	int fd, create = (oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL);
> > >  	wchar_t wfilename[MAX_LONG_PATH];
> > >  	open_fn_t open_fn;
> > > +	int tries = 0;
> > >
> > >  	va_start(args, oflags);
> > >  	mode = va_arg(args, int);
> > > @@ -813,7 +814,11 @@ int mingw_open (const char *filename, int oflags, ...)
> > >  	else if (xutftowcs_long_path(wfilename, filename) < 0)
> > >  		return -1;
> > >
> > > -	fd = open_fn(wfilename, oflags, mode);
> > > +	do {
> > > +		fd = open_fn(wfilename, oflags, mode);
> > > +	} while (fd < 0 && GetLastError() == ERROR_SHARING_VIOLATION &&
> > > +		 retry_ask_yes_no(&tries, "Opening '%s' failed because another "
> > > +			"application accessed it. Try again?", filename));
> > >
> > >  	if ((oflags & O_CREAT) && fd >= 0 && are_wsl_compatible_mode_bits_enabled()) {
> > >  		_mode_t wsl_mode = S_IFREG | (mode&0777);
> >
> > Wait, this is surprising to me as I saw the error happening when calling
> > rename, not open. So why would retries in `open()` fix the symptom? I'm
> > probably missing something.
> 
> I am sorry, I did not really read the bug report in detail, I only meant
> to unblock Git for Windows v2.47.0-rc1 and thought I had a fix in hand. It
> certainly fixed the failures on my local machine, but it unfortunately did
> not fix the problems in CI.

Nothing to be sorry about. Quite on the contrary, thanks for chiming in
here, I was hoping for your input.

> I tried to debug in CI a bit, but it is a gnarly bug to investigate, what
> with plenty of processes the intentionally block each other, and no `gdb`
> to help.

Yup, indeed it is.

> > In any case I also tried something like the below patch (sorry,
> > whitespace-broken).
> 
> Oh, you reminded me that the `mingw_rename()` function looks
> _substantially_ different in Git for Windows than in Git. I did not have
> the time (or the strength of mind, more like) to upstream those changes
> yet.
> 
> > But unfortunately this still caused permission errors when the new path
> > was held open by another process.
> 
> Yes, this will _always_ be a problem, I think. The
> `FILE_RENAME_POSIX_SEMANTICS` as per its documentation should help, but if
> it does not in your tests it might actually not quite work as advertised
> (wouldn't be the first time I encounter such an issue).

It could of course be that the code I wrote is just plain wrong. I had
to stub out the struct as well as the constants because those are not
available on my Windows system, due to whatever reason. Not even when
bumping _WIN32_WINNT to 0x0A00 (W10+). Maybe this is because I'm using
Windows 10, but I thought that thish should have been available starting
with Windows 10 RC1.

> I tried to read through the code (it's a lot!) to figure out whether there
> is potentially any situation when the `tables.list` file is opened but not
> closed immediately, but couldn't find any. Do you know off-hand of any
> such scenario?

We indeed do keep the `tables.list` file descriptor open as part of our
stat cache. But that only happens on non-Windows system. This all
happens in `reftable_stack_reload_maybe_reuse()` and is documented quite
extensively in there.

> > I think for now I'd still lean into the direction of adding the !WINDOWS
> > prerequisite to the test and increasing timeouts such that I can
> > continue to investigate without time pressure.
> 
> Let me bang my head against this problem for a little while longer. You
> might be right, though, that this is a thing we cannot fix in time for
> v2.47.0, which would be sad.

A bit sad, yes, but as I mentioned this is at least not a regression.
The test just demonstrates that the improvements I did in this area are
not yet sufficient for all platforms and that I need to spend some more
time on it. Or that the central "tables.list" mechanism is not a good
fit for Windows, which would be a shame.

I'll send the patch as a reply to this thread so that it can be picked
up as required.

Patrick




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux