On 2025-02-10 at 15:56:59, Maloney, Bryan wrote: > ### Error > Kernel logs: > ``` > NFSv4: state recovery failed for open file pack/tmp_pack_aR0Mu3, error = -13 > ``` > Git clone output: > ``` > fatal: write error: Bad file descriptor, 137.31 MiB | 45.77 MiB/s > fatal: fetch-pack: invalid index-pack output > ``` > > > ### Context > > The following error is seen when running git clone over NFSv4 and a failover, or server restart, occurs: > ``` > NFSv4: state recovery failed for open file pack/tmp_pack_aR0Mu3, error = -13 > ``` > This error is an access denied error that happens when you try to open a file with insufficient permissions. In this case the file being opened is a read only file and it is attempted to be opened with write access. > > Git opens/creates this file with the O_RDWR flag but then applies read only permissions to it, 0444. Since the permissions are changed after the file is opened, the file handle works fine. However if the file was attempted to be re-opened with that same file handle we would see a -13 error. This is what we see following a failover in NFSv4. When clients reclaim their open files, the NFS server re-evaluates the file access. Your description of the problem is spot on. We intentionally set the permissions to 0444 because we never want anyone to change loose object files or packs, since doing so would corrupt the repository. This behaviour is specifically allowed by POSIX[0]: The argument following the oflag argument does not affect whether the file is open for reading, writing, or for both. POSIX does not allow the re-evaluation of file system access once the file is open, so it sounds like your file system is not POSIX compliant, and Git generally requires lots of POSIX-compliant functionality from the file system. For instance, we also require the POSIX consistency guarantees[1], among myriad others: If a read() of file data can be proven (by any means) to occur after a write() of the data, it must reflect that write(), even if the calls are made by different threads. A similar requirement applies to multiple write operations to the same file position. The implicit violation of that particular requirement is why cloud syncing services often corrupt the repository. Could you adjust your NFSv4 server such that is synchronizes state among the primary and replicas in case of a required failover? I know we have people successfully using Git with NFS without problems, although this particular issue does often hit non-POSIX-compliant NFS implementations in a variety of ways. (This particular variant is new to me, though.) > This is an issue for active/passive HA file servers. Since NFSv4 evaluates file permissions at the time of opening a file, this FD will always get an access denied error if a failover occurs during git clone. I'm not sure there's even a good way to solve this problem on the Git side, since I suspect that if we opened the file as 0644 and then immediately did an fchmod to 0444, if you'd still fail here if the file is reopened. Is that correct? I'll also point out that there's a variety of other software that does the same thing as Git does, including zsh and Emacs, so fixing this in Git doesn't really fix the entire problem that your NFS server has, since all of that other software will also be broken in at least some cases and require similar workarounds. (I discovered this with a simple, 30-second search on GitHub some time back.) As far as I'm aware, all other Git implementations also do the same thing as Git does, so you'd also need to patch go-git, libgit2, and every other implementation as well. [0] https://pubs.opengroup.org/onlinepubs/9799919799/functions/open.html [1] https://pubs.opengroup.org/onlinepubs/9799919799/functions/write.html -- brian m. carlson (they/them or he/him) Toronto, Ontario, CA
Attachment:
signature.asc
Description: PGP signature