Re: [PATCH] refs: work around network caching on Windows

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Fri, 15 Jul 2022 10:30:04 +0200

On Fri, Jul 15 2022, Johannes Schindelin via GitGitGadget wrote:

> From: Pierre Garnier <pgarnier@xxxxxxxx>
>
> Network shares sometimes use aggressive caching, in which case a
> just-created directory might not be immediately visible to Git.
>
> One symptom of this scenario is the following error:
>
> 	$ git tag -a -m "automatic tag creation"  test_dir/test_tag
> 	fatal: cannot lock ref 'refs/tags/test_dir/test_tag': unable to resolve reference 'refs/tags/test_dir/test_tag': Not a directory
>
> Note: This does not necessarily happen in all Windows setups. One setup
> where it _did_ happen is a Windows Server 2019 VM, and as hinted in
>
> 	http://woshub.com/slow-network-shared-folder-refresh-windows-server/
>
> the following commands worked around it:
>
> 	Set-SmbClientConfiguration -DirectoryCacheLifetime 0
> 	Set-SmbClientConfiguration -FileInfoCacheLifetime 0
> 	Set-SmbClientConfiguration -FileNotFoundCacheLifetime 0
>
> This would impact performance negatively, though, as it essentially
> turns off all caching, therefore we do not want to require users to do
> that just to be able to use Git on Windows.
>
> The underlying culprit is that `GetFileAttributesExW()` that is called from
> `mingw_lstat()` can raise an error `ERROR_PATH_NOT_FOUND`, which is
> translated to `ENOTDIR`, as opposed to `ENOENT` as expected on Linux.
>
> Therefore, when trying to read a ref, let's allow `ENOTDIR` in addition
> to `ENOENT` to indicate that said ref is missing.
>
> This fixes https://github.com/git-for-windows/git/issues/3727

This really has much wider implications, as we hard depend on POSIX
semantics in various other places. E.g. we'll the SHA-1 collision
detection sanity check (not sha1dc, the "does it exist?") would be racy
on such a system, wouldn't it?

>  refs/files-backend.c  | 2 +-
>  refs/packed-backend.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/refs/files-backend.c b/refs/files-backend.c
> index 8db7882aacb..b2a880f62f0 100644
> --- a/refs/files-backend.c
> +++ b/refs/files-backend.c
> @@ -381,7 +381,7 @@ stat_ref:
>  	if (lstat(path, &st) < 0) {
>  		int ignore_errno;
>  		myerr = errno;
> -		if (myerr != ENOENT || skip_packed_refs)
> +		if ((myerr != ENOENT && myerr != ENOTDIR) || skip_packed_refs)
>  			goto out;
>  		if (refs_read_raw_ref(refs->packed_ref_store, refname, oid,
>  				      referent, type, &ignore_errno)) {
> diff --git a/refs/packed-backend.c b/refs/packed-backend.c
> index 97b68377673..23d478627a7 100644
> --- a/refs/packed-backend.c
> +++ b/refs/packed-backend.c
> @@ -480,7 +480,7 @@ static int load_contents(struct snapshot *snapshot)
>  
>  	fd = open(snapshot->refs->path, O_RDONLY);
>  	if (fd < 0) {
> -		if (errno == ENOENT) {
> +		if (errno == ENOENT || errno == ENOTDIR) {
>  			/*
>  			 * This is OK; it just means that no
>  			 * "packed-refs" file has been written yet,
>
> base-commit: bbea4dcf42b28eb7ce64a6306cdde875ae5d09ca

So I'm skeptical that this can work at all, but in any case wrapping
this non-POSIX hack in an #ifdef for the relevant platform is somtething
I really think we should have here, or "#ifdef NON_POSIX_FS_HACK" or
something.

You don't want to be carefully reviewing this code thinking wtf, only to
discover later that it's impossible on a well-behaved FS.

Also, NFS has similar options (which I've seen hard break git repos &
corrupt them in the past)< how do its various dangerous caching options
behave in these scenarios?

IOW if we're supporting non-POSIX behavior on platform A, are we
inadvertently making the non-POSIX behavior on platform B even worse?
Even more of a reason to wrap it in ifdefs...

But I really think the answer to this is similar to brian's FAQ patches
for git repos on "cloud mounts", I.e. document carefully that it's
likely to corrupt your repo in unexpected ways.

So I'd be much more comfortable with a workaround that stole what we do
for the *.lock spinning here, i.e. we'd detect this errno, say "wtf,
non-POSIX?" then spin for N ms, and hope to get "past the race".

That would be guaranteed not to suffer from odd corruption issues (as
the behavior wouldn't change, we'd just wait and hope to "catch up")>

Wouldn't that be narrower & better here?