Michal Suchánek <msuchanek@xxxxxxx> 于2022年12月8日周四 09:16写道: > > On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote: > > > > On Wed, Dec 07 2022, ZheNing Hu wrote: > > > > > I would like to run git gc on my git server periodically, which should help > > > reduce storage space and optimize the read performance of the repository. > > > I know github, gitlab all have this process... > > > > > > But the concurrency between git gc and other git commands is holding > > > me back a bit. > > > > > > git-gc [1] docs say: > > > > > > On the other hand, when git gc runs concurrently with another process, > > > there is a risk of it deleting an object that the other process is using but > > > hasn’t created a reference to. This may just cause the other process to > > > fail or may corrupt the repository if the other process later adds > > > a reference > > > to the deleted object. > > > > > > It seems that git gc is a dangerous operation that may cause data corruption > > > concurrently with other git commands. > > > > > > Then I read the contents of Github's blog [2], git gc ---cruft seems to be used > > > to keep those expiring unreachable objects in a cruft pack, but the blog says > > > github use some special "limbo" repository to keep the cruft pack for git data > > > recover. Well, a lot of the details here are pretty hard to understand for me :( > > > > > > However, on the other hand, my git server is still at v2.35, and --cruft was > > > introduced in v2.38, so I'm actually more curious about: how did the server > > > execute git gc correctly in the past? Do we need a repository level "big lock" > > > that blocks most/all other git operations? What should the behavior of users' > > > git clone/push be at this time? Report error that the git server is performing > > > git gc? Or just wait for git gc to complete? > > > > > > Thanks for any comments and help! > > > > > > [1]: https://git-scm.com/docs/git-gc > > > [2]: https://github.blog/2022-09-13-scaling-gits-garbage-collection/ > > > > Is this for a very large hosting site that's anywhere near GitHub, > > GitLab's etc. scale? > > > > A "git gc" on a "live" repo is always racy in theory, but the odds that > > you'll run into data corrupting trouble tends to approach zero as you > > increase the gc.pruneExpire setting, with the default 2 weeks being more > > than enough for even the most paranoid user. > > And that two weeks expiration applies to what, exactly? > > For commits there is author date and commit date but many other objecs > won't have these I suppose. And the date when the object is pushed into > the repository is unrelated to these two, anyway. > > > So, I think you probably don't need to worry about it. Other major > > hosting sites do run "git gc" on live repositories, but as always take > > backups etc. > > Actually, it is a real problem. With <100 users and some scripting I got > unexplained repository corruptions which went away when gc was disabled. > > YMMV > > Bad locking design is always a landmine waiting to get triggered. If you > step carefully you might avoid it for some time. > I agree with this. What I hope to be able to do more is "no error at all" rather than "small probability of error" > Thanks > > Michal