On Thu, 19 Jul 2007, Johannes Schindelin wrote: > > There is one severe shortcoming, though. Since tree objects can > contain file names of a variable length, it is not possible to > do a binary search for the correct base name in the tree object's > contents. Well, I've been thinking about this, and that's not really entirely correct. It *is* possible to do a binary search, it's just a bit complicated, because you have to take the "halfway" thing, and find the beginning of an entry. But the good news is that the tree entries have a very fixed format, and one that is actually amenable to finding where they start. It gets a bit complicated, but: - SHA1's contain random bytes, so we cannot really depend on their content. Fair enough. But on the other hand, they are fixed length, which means.. - Each SHA1 is always preceded by a zero byte (it is what separates the filename from the SHA1), and while filenames too can have arbitrary content (and arbitrary length), we know that the *filename* doesn't have a zero byte in it. - so finding the beginning of a tree entry should be as easy as finding two zero bytes that are have at least 20 bytes in between them, and then you *know* that the second zero byte is the one that starts a new SHA1 (it cannot be _inside_ a SHA1: if it was, there would be less than twenty bytes to the previous '\0', and it cannot be inside the filename either). - And you know that 20 bytes after that '\0' is the next tree entry! Now, what does this mean? It means that if we actually know the filename we're looking for, and we're looking at a large range, we really *could* start out with binary searching. We would do something like this: unsigned char *start; unsigned long size; while (size > 200) { /* * Look halfway, and then back up a bit, because we * expect it to take us about 20 characters to find * the zero we look for, and an additional 20 * characters is the subsequent SHA1. */ unsigned long guess = size / 2 - 40; /* * This is the offset past which a zero means that * we're good. If we don't find a zero in the first * twenty bytes, that means that the first zero we * find must be the beginning of a SHA1! */ unsigned long goal_zero = guess + 20; for (;;) { unsigned char c; /* * We need at least 22 characters more: the * '\0' and the SHA1, and then the next entry. * We know the ASCII mode is 4 characters, so * we migth as well make the rule "within 26 of * end end". */ if (guess >= size-26) goto fall_back_to_linear_search; c = start[guess++]; if (c) continue; /* Found it? */ if (guess > goal_zero) break; /* * We found a zero that wasn't 20 bytes away, * that means we have to reset out goal.. */ last_zero = guess + 20; } /* * "guess" now points to one past the '\0': the SHA1 of * the previous entry. Add 20, and it points at the start * of a valid tree entry. */ guess = guess + 20; /* Length of the entry: ascii string + '\0' + SHA1 */ thisentrylen = strlen(start + guess) + 1 + 20; .. compare the entry we found with .. the entry we are looking for! if (found < lookedfor) { size = guess; continue; } else if (found == lookedfor) { Yay! FOUND! } else { guess += thisentry; size -= guess; start += guess; continue; } } fall_back_to_linear_search: .. linear search in [ start, size ] .. Anyway, as you can tell, the above is totally untested, but I really think it should work. Whether it really helps, I dunno. But if somebody is interested in trying, it might be cool to see. And yes, the "search for zero bytes" is not *guaranteed* to find any beginning at all, if you have lots of short names, *and* lots of zero bytes in the SHA1's. But while short names may be common, zero bytes in SHA1's are not so much (since you should expect to see a very even distribution of bytes, and as such most SHA1's by far should have no zero bytes at all!) So if you're really really *really* unlucky, you might end up having to fall back on the linear search. But it still works! Can anybody see anything wrong in my thinking above? (And the real question is whether it really helps. I suspect it does actually help for big directories, and that it is worth doing, but maybe the magic number in "while (size > 200)" could be tweaked. The logic of that was that binary searching doesn't work very well for just a few entries (and "size < 200" implies ~5-6 directory entries), but also that linear search is actually perfectly good when it's just a couple of cache-lines, and binary searching - especially with the complication of having to find the beginning - isn't worth it unless it really means that we can avoid a cache miss. Of course, it may well be that the *real* cost of the directories is just the uncompression thing, and that the search is not the problem. Who knows? Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html