Re: [PATCH 0/2] commit-graph: suggest deleting corrupt graphs

Josh Steadmon <steadmon@xxxxxxxxxx> · Wed, 24 Apr 2024 12:30:05 -0700

On 2024.02.22 16:05, Junio C Hamano wrote:
> Josh Steadmon <steadmon@xxxxxxxxxx> writes:
> 
> > At $WORK, we've had a few occasions where someone's commit-graph becomes
> > corrupt, and hits various BUG()s that block their day-to-day work. When
> > this happens, we advise the user to either disable the commit graph, or
> > to delete it and let it be regenerated.
> >
> > It would be a nicer user experience if we can make this a self-serve
> > procedure. To do this, let's add a new `git commit-graph clear`
> > subcommand so that users don't need to manually delete files under their
> > .git directories. And to make it self-documenting, update various BUG(),
> > die(), and error() messages to suggest removing the commit graph to
> > recover from the corruption.
> 
> I am of two minds.
> 
> For one, if we know there is a corruption and if we know that we
> will certainly recover cleanly if we removed these files, it would
> be fair for an end-user to respond with: instead of telling me to
> run "commit-graph clear", you can run it for me, can't you?
> 
> The other one is if it hinders debugging the root cause to run
> "clear", whether it is done by the end-user or by the mechanism that
> detects and dies upon discovery of a corruption.  Do we know how
> these commit-graph files become corrupt?  How valuable would these
> corrupt files be to help us track down where the corruption comes
> from?  If they are not all that useful in debugging, then removing
> them ourselves or telling users to remove them may be OK, of course.
> 
> Do these BUG()s come from corruption that can be diagnosed upfront
> when we "open" the commit-graph files?  I am wondering if it would
> be the matter of teaching prepare_commit_graph() to check for
> corruption and return without enabling the support.
> 
> Thanks.

Sorry for the late reply, this got buried in my inbox. The corruption we
saw was related to a generation numbers bug [1] that I think was only
present for a short while in 'next'.

[1] https://lore.kernel.org/git/YBn3fxFe978Up5Ly@xxxxxxxxxx/

I believe that being able to examine the files after the corruption was
detected did help us narrow down the issue, so I would lean towards not
automatically deleting them upon detecting corruption.

I don't think that this case would be detectable without running a full
`git commit-graph verify` up front.