Re: [REVISED PATCH 2/6] Introduce commit notes

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 19 Jul 2007 10:42:18 -0700 (PDT)

On Thu, 19 Jul 2007, Junio C Hamano wrote:
> > ...
> > But the real problem of this approach of course is that this is
> > not reliable and can get a false match.  You can find your
> > beginning NUL in the SHA-1 part of one entry, and terminating
> > NUL later in the SHA-1 part of next entry, and you will never
> > notice.

[ I didn't react to this in your first email, because I thought you were 
  talking about your "use the rules for the ASCII part", and thought you 
  talked about how *that* was not reliable and can get a false match). But 
  it seems that you were actually talking about the NUL character test ]

Nope, wrong.

Why? Because there must always be a NUL *between* different SHA1's. 
There's *always* a NUL character that precedes a SHA1. So when you have 
two NUL characters (with no other NUL's between them), you *know* that 
they cannot be from two different SHA1's. If the first one was from an 
earlier SHA1, then the second one is *guaranteed* to be the one that 
happens *before* the next SHA1.

See?

You really have two, and only two cases:

 - NUL's that are within 20 bytes of each other: you don't know anything 
   about them. It might be that they are both within the *same* SHA1, or 
   the first one was the one that separated the ASCII part from the SHA1, 
   or the first one was a NUL in the previous SHA1 and the second one was 
   the NUL after the ASCII part.

   So two NUL's in the same 21-byte region are not reliable (ie less than 
   20 bytes in *between* them). They tell you nothing, and you must just 
   ignore them. 

 - NUL's that are more than 20 bytes apart: the second NUL is *guaranteed* 
   to be the start of the next SHA1.

   They cannot be part of the same "NUL + sha1", and thus the first NUL 
   *must* be from a previous SHA1 (or the NUL that preceded it). And that 
   means that the second NUL *must* be the NUL that precedes the next 
   SHA1.

So there is *no* ambiguity what-so-ever. It's not about guessing, and it's 
not about "luck". If you don't find two NUL bytes separated by more than 
20 bytes, you start the linear search.

> In other words, if you are really really *really* unlucky, not
> only you might end up being fooled by random byte sequences in
> SHA-1 part of the tree object, you would not even notice that
> you have to fall back on the linear search.

Wrong. Either you find a guanteed rigth place, or you ran out of the 
buffer and know you have to fall back on the linear search. 

No fooled.

> I've long time ago concluded that if we care about reliability
> (and we do very much), a bisectable tree without breaking
> backward compatibility is impossible.

No. You concluded incorrectly. I'm pretty damn sure the current tree 
format is perfectly fine. It's dense, it's nice and linear, and it's 
easily bisectable.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html