Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes: > And yes, the "search for zero bytes" is not *guaranteed* to find any > beginning at all, if you have lots of short names, *and* lots of zero > bytes in the SHA1's. But while short names may be common, zero bytes in > SHA1's are not so much (since you should expect to see a very even > distribution of bytes, and as such most SHA1's by far should have no zero > bytes at all!) > > So if you're really really *really* unlucky, you might end up having to > fall back on the linear search. But it still works! > > Can anybody see anything wrong in my thinking above? Another anchoring clue you seem not to be exploiting fully is that the ASCII part must match "^[1-7][0-7]{4,5} " (mode bytes). But the real problem of this approach of course is that this is not reliable and can get a false match. You can find your beginning NUL in the SHA-1 part of one entry, and terminating NUL later in the SHA-1 part of next entry, and you will never notice. However, in the case of Dscho's "notes" code, I do not think (1) you do not have to guess like the above, and (2) the problem is much simpler. Dcsho's "note" looks like a tree full of two-byte [0-9a-f]{2} names, each of them points at another tree, with the second level tree being full of 32-byte [0-9a-f]{38} names, each of them points at a blob. So it is a much more regular, strict shape. And in order to look for a note for an object whose name is ([0-9a-f]{2})([0-9a-f]{38}), you will find the blob that is at "$1/$2" in a "note". I was suggesting to have a specialized parser only to read such tree objects that are "abused" to represent notes. You can cheaply validate that these trees are of expected shape. (1) Validate that size of the toplevel tree is multiple of 29 = (5 + 1 + 2 + 1 + 20); the second level should be multiple of 66 = (6 + 1 + 38 + 1 + 20). These two levels of trees are of fixed-entry-length that allows easy binary search. (2) While binary searching trees of either level, you can validate that the entry looks like from a note (for the toplevel, "40000 [0-9a-f]{2}\0", for the second level, "100644 [0-9a-f]{38}\0"). For an added safety, a "notes" writer could even throw in signature bytes (say, a symlink whose name is " !" in the top-level tree, and another symlink " !{37}" in the second-level tree) to protect the reader. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html