Question: How to execute git-gc correctly on the git server

ZheNing Hu <adlternative@xxxxxxxxx> · Wed, 7 Dec 2022 23:58:13 +0800

Hi,

I would like to run git gc on my git server periodically, which should help
reduce storage space and optimize the read performance of the repository.
I know github, gitlab all have this process...

But the concurrency between git gc and other git commands is holding
me back a bit.

git-gc [1] docs say:

    On the other hand, when git gc runs concurrently with another process,
    there is a risk of it deleting an object that the other process is using but
    hasn’t created a reference to. This may just cause the other process to
    fail or may corrupt the repository if the other process later adds
a reference
    to the deleted object.

It seems that git gc is a dangerous operation that may cause data corruption
concurrently with other git commands.

Then I read the contents of Github's blog [2], git gc ---cruft seems to be used
to keep those expiring unreachable objects in a cruft pack, but the blog says
github use some special "limbo" repository to keep the cruft pack for git data
recover. Well, a lot of the details here are pretty hard to understand for me :(

However, on the other hand, my git server is still at v2.35, and --cruft was
introduced in v2.38, so I'm actually more curious about: how did the server
execute git gc correctly in the past? Do we need a repository level "big lock"
that blocks most/all other git operations? What should the behavior of users'
git clone/push be at this time? Report error that the git server is performing
git gc? Or just wait for git gc to complete?

Thanks for any comments and help!

[1]: https://git-scm.com/docs/git-gc
[2]: https://github.blog/2022-09-13-scaling-gits-garbage-collection/