Re: Question: How to execute git-gc correctly on the git server

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Michal Suchánek <msuchanek@xxxxxxx> 于2022年12月8日周四 09:16写道:
>
> On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote:
> >
> > On Wed, Dec 07 2022, ZheNing Hu wrote:
> >
> > > I would like to run git gc on my git server periodically, which should help
> > > reduce storage space and optimize the read performance of the repository.
> > > I know github, gitlab all have this process...
> > >
> > > But the concurrency between git gc and other git commands is holding
> > > me back a bit.
> > >
> > > git-gc [1] docs say:
> > >
> > >     On the other hand, when git gc runs concurrently with another process,
> > >     there is a risk of it deleting an object that the other process is using but
> > >     hasn’t created a reference to. This may just cause the other process to
> > >     fail or may corrupt the repository if the other process later adds
> > > a reference
> > >     to the deleted object.
> > >
> > > It seems that git gc is a dangerous operation that may cause data corruption
> > > concurrently with other git commands.
> > >
> > > Then I read the contents of Github's blog [2], git gc ---cruft seems to be used
> > > to keep those expiring unreachable objects in a cruft pack, but the blog says
> > > github use some special "limbo" repository to keep the cruft pack for git data
> > > recover. Well, a lot of the details here are pretty hard to understand for me :(
> > >
> > > However, on the other hand, my git server is still at v2.35, and --cruft was
> > > introduced in v2.38, so I'm actually more curious about: how did the server
> > > execute git gc correctly in the past? Do we need a repository level "big lock"
> > > that blocks most/all other git operations? What should the behavior of users'
> > > git clone/push be at this time? Report error that the git server is performing
> > > git gc? Or just wait for git gc to complete?
> > >
> > > Thanks for any comments and help!
> > >
> > > [1]: https://git-scm.com/docs/git-gc
> > > [2]: https://github.blog/2022-09-13-scaling-gits-garbage-collection/
> >
> > Is this for a very large hosting site that's anywhere near GitHub,
> > GitLab's etc. scale?
> >
> > A "git gc" on a "live" repo is always racy in theory, but the odds that
> > you'll run into data corrupting trouble tends to approach zero as you
> > increase the gc.pruneExpire setting, with the default 2 weeks being more
> > than enough for even the most paranoid user.
>
> And that two weeks expiration applies to what, exactly?
>
> For commits there is author date and commit date but many other objecs
> won't have these I suppose. And the date when the object is pushed into
> the repository is unrelated to these two, anyway.
>
> > So, I think you probably don't need to worry about it. Other major
> > hosting sites do run "git gc" on live repositories, but as always take
> > backups etc.
>
> Actually, it is a real problem. With <100 users and some scripting I got
> unexplained repository corruptions which went away when gc was disabled.
>
> YMMV
>
> Bad locking design is always a landmine waiting to get triggered. If you
> step carefully you might avoid it for some time.
>

I agree with this. What I hope to be able to do more is "no error at all"
rather than "small probability of error"

> Thanks
>
> Michal




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux