Re: [REVISED PATCH 2/6] Introduce commit notes

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 18 Jul 2007 20:28:27 -0700 (PDT)

On Thu, 19 Jul 2007, Johannes Schindelin wrote:
>
> There is one severe shortcoming, though.  Since tree objects can
> contain file names of a variable length, it is not possible to
> do a binary search for the correct base name in the tree object's
> contents.

Well, I've been thinking about this, and that's not really entirely 
correct.

It *is* possible to do a binary search, it's just a bit complicated, 
because you have to take the "halfway" thing, and find the beginning of 
an entry.

But the good news is that the tree entries have a very fixed format, and 
one that is actually amenable to finding where they start. It gets a bit 
complicated, but:

 - SHA1's contain random bytes, so we cannot really depend on their 
   content. Fair enough. But on the other hand, they are fixed length, 
   which means..

 - Each SHA1 is always preceded by a zero byte (it is what separates the 
   filename from the SHA1), and while filenames too can have arbitrary 
   content (and arbitrary length), we know that the *filename* doesn't 
   have a zero byte in it.

 - so finding the beginning of a tree entry should be as easy as finding 
   two zero bytes that are have at least 20 bytes in between them, and 
   then you *know* that the second zero byte is the one that starts a new 
   SHA1 (it cannot be _inside_ a SHA1: if it was, there would be less 
   than twenty bytes to the previous '\0', and it cannot be inside the
   filename either).

 - And you know that 20 bytes after that '\0' is the next tree entry!

Now, what does this mean? It means that if we actually know the filename 
we're looking for, and we're looking at a large range, we really *could* 
start out with binary searching. We would do something like this:

	unsigned char *start;
	unsigned long size;

	while (size > 200) {
		/*
		 * Look halfway, and then back up a bit, because we 
		 * expect it to take us about 20 characters to find
		 * the zero we look for, and an additional 20
		 * characters is the subsequent SHA1.
		 */
		unsigned long guess = size / 2 - 40;

		/*
		 * This is the offset past which a zero means that 
		 * we're good. If we don't find a zero in the first
		 * twenty bytes, that means that the first zero we
		 * find must be the beginning of a SHA1!
		 */
		unsigned long goal_zero = guess + 20;

		for (;;) {
			unsigned char c;

			/*
			 * We need at least 22 characters more: the
			 * '\0' and the SHA1, and then the next entry.
			 * We know the ASCII mode is 4 characters, so
			 * we migth as well make the rule "within 26 of
			 * end end".
			 */
			if (guess >= size-26)
				goto fall_back_to_linear_search;
			c = start[guess++];
			if (c)
				continue;
			/* Found it? */
			if (guess > goal_zero)
				break;
			/*
			 * We found a zero that wasn't 20 bytes away, 
			 * that means we have to reset out goal..
			 */
			last_zero = guess + 20;
		}
		/*
		 * "guess" now points to one past the '\0': the SHA1 of
		 * the previous entry. Add 20, and it points at the start
		 * of a valid tree entry.
		 */
		guess = guess + 20;

		/* Length of the entry: ascii string + '\0' + SHA1 */
		thisentrylen = strlen(start + guess) + 1 + 20;

		.. compare the entry we found with
		.. the entry we are looking for!
		if (found < lookedfor) {
			size = guess;
			continue;
		} else if (found == lookedfor) {
			Yay! FOUND!
		} else {
			guess += thisentry;
			size -= guess;
			start += guess;
			continue;
		}
	}
  fall_back_to_linear_search:

	.. linear search in [ start, size ] ..

Anyway, as you can tell, the above is totally untested, but I really think 
it should work. Whether it really helps, I dunno. But if somebody is 
interested in trying, it might be cool to see.

And yes, the "search for zero bytes" is not *guaranteed* to find any 
beginning at all, if you have lots of short names, *and* lots of zero 
bytes in the SHA1's. But while short names may be common, zero bytes in 
SHA1's are not so much (since you should expect to see a very even 
distribution of bytes, and as such most SHA1's by far should have no zero 
bytes at all!)

So if you're really really *really* unlucky, you might end up having to 
fall back on the linear search. But it still works!

Can anybody see anything wrong in my thinking above?

(And the real question is whether it really helps. I suspect it does 
actually help for big directories, and that it is worth doing, but maybe 
the magic number in "while (size > 200)" could be tweaked.

The logic of that was that binary searching doesn't work very well for 
just a few entries (and "size < 200" implies ~5-6 directory entries), but 
also that linear search is actually perfectly good when it's just a couple 
of cache-lines, and binary searching - especially with the complication of 
having to find the beginning - isn't worth it unless it really means that 
we can avoid a cache miss.

Of course, it may well be that the *real* cost of the directories is just 
the uncompression thing, and that the search is not the problem. Who 
knows? 

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html