Re: Avoid race condition between fetch and repack/gc?

Derrick Stolee <stolee@xxxxxxxxx> · Mon, 16 Mar 2020 08:10:56 -0400

On 3/16/2020 4:23 AM, Andreas Krey wrote:
> Hi all,
> 
> we occasionally seeing things like this:
> 
> | DEBUG: 11:25:20: git -c advice.fetchShowForcedUpdates=false fetch --no-show-forced-updates -q --prune

I'm happy to see these options. I hope they are helping you!

> | Warning: Permanently added '[socgit.$company.com]:7999' (RSA) to the list of known hosts.
> | remote: fatal: packfile ./objects/pack/pack-20256f2be3bd51b57e519a9f2a4d3df09f231952.pack cannot be accessed        
This _could_ mean a lot of things, but....

> | error: git upload-pack: git-pack-objects died with error.
> | fatal: git upload-pack: aborting due to possible repository corruption on the remote side.
> | remote: aborting due to possible repository corruption on the remote side.
> | fatal: protocol error: bad pack header
> 
> and when you look in the server repository there is a new packfile dated just around
> that time. It looks like the fetch tries to access a packfile that it assumes to exist,
> but the GC on the server throws it away just in that moment, and thus upload-pack fails.

...your intuition about repacking seems accurate. The important part of the
race condition is likely that the server process read and holds a read handle
on the .idx file, but when looking for the object contents it tries to open
the .pack file which was deleted.

This error is emitted by use_pack() in packfile.c. I'm surprised there is no
fallback here, and we simply die().

The race condition seems to be related to the loop in do_oid_object_info_extended()
in sha1-file.c looping through packs until finding the object in question: it does
not verify that the .pack file is open with a valid handle before terminating the
loop and calling packed_object_info().

Perhaps the fix is to update do_oid_object_info_extended() to "accept" a pack as
the home of the object only after verifying the pack is either open or can be
opened. That seems like the least-invasive fix to me.

The more-invasive fix is to modify the stack from packed_object_info() to
use_pack() to use error messages and return codes instead of die(). This would
still need to affect the loop in do_oid_object_info_extended(), but may be a
better way to handle this situation in general.

Of course, this is a very critical code path, and maybe other community members
have more context as to why we are not already doing this?

> Is there a way to avoid this?
> 
> Should there be, like git repack waiting a bit before deleting old packfiles?

This all depends on how you are managing your server. It is likely that you
could create your own maintenance that handles this for you.

The "git multi-pack-index (expire|repack)" cycle is built to prevent this sort
of issue, but is not yet integrated well with reachability bitmaps. You likely
require the bitmaps to keep your server performance, so that may not be a way
forward for you.

Thanks,
-Stolee