Re: RFC v3: Another proposed hash function transition plan

Jeff King <peff@xxxxxxxx> · Thu, 7 Sep 2017 23:34:03 -0400

On Fri, Sep 08, 2017 at 11:40:21AM +0900, Junio C Hamano wrote:

> Our "git fsck" already treats certain brokenness (like a tree whose
> entry has mode that is 0-padded to the left) as broken but still
> tolerate them.  I am not sure if it is sufficient to diagnose and
> declare broken and invalid when we see sha3-content that records
> these "mechanically derivable but expensive to compute" pieces of
> information incorrectly.
> 
> I am leaning towards saying "yes, catching in fsck is enough" and
> suggesting to add generation number to sha3-content of the commit
> objects, and to add even the "original sha1 name" thing if we find
> good use of it.  But I cannot shake this nagging feeling off that I
> am missing some huge problems that adding these fields and opening
> ourselves to more classes of broken objects.

I share your nagging feeling.

I have two thoughts on the "fsck can catch it" line of reasoning.

  1. It's harder to fsck generation numbers than other syntactic
     elements of an object, because it inherently depends on the links.
     So I can't fsck a commit object in isolation. I have to open its
     parents and check _their_ generation numbers.

     In some sense that isn't a big deal. A real fsck wants to know that
     we _have_ the parents in the first place. But traditionally we've
     separated "is this syntactically valid" from "do we have full
     connectivity". And features like shallow clones rely on us fudging
     the latter but not the former. A shallow history could never
     properly fsck the generation numbers.

     A multiple-hash field doesn't have this problem. It's purely a
     function of the bytes in the object.

  2. I wouldn't classify the current fsck checks as a wild success in
     containing breakages. If a buggy implementation produces invalid
     objects, the same buggy implementation generally lets people (and
     their colleagues) unwittingly build on top of those objects. It's
     only later (sometimes much later) that they interact with a
     non-buggy implementation whose fsck complains.

     And what happens then? If they're lucky, the invalid objects
     haven't spread far, and the worst thing is that they have to learn
     to use filter-branch (which itself is punishment enough). But
     sometimes a significant bit of history has been built on top, and
     it's awkward or impossible to rewrite it.

     That puts the burden on whoever is running the non-buggy
     implementation that wants to reject the objects. Do they accept
     these broken objects? If so, what do they do to mitigate the wrong
     answers that Git will return?

I'm much more in favor of keeping that data outside the object-hash
computation, and caching the pre-computed results as necessary. Those
cache can disagree with the objects, of course, but the cost to dropping
and re-building them is much lower than a history rewrite.

I'm speaking primarily to the generation-number thing, where I really
don't think there's any benefit to embedding it in the object beyond the
obvious "well, it has to go _somewhere_, and this saves us implementing
a local cache layer".  I haven't thought hard enough on the
multiple-hash thing to know if there's some other benefit to having it
inside the objects.

-Peff