On Sat, 31 Mar 2007, Geert Bosch wrote: > > I've been working on a design for a new index file format, and I'll > include my current draft at the end of this message. It is not finished, > and I've not written much code yet, but I'd like to prevent duplicated > work and give others the opportunity to borrow from the ideas so far. Ok, here's one comment... I think your basic assumption that the current index-pack is bad is broken ;) The thing is, I think the index-pack could be improved, but I think we can get a bigger improvement from just being more intelligent about searching than from actually needing to re-organize the pack. Here's a few clues: - binary search is certainly not the only way to do lookups on sorted contents - one of the fundamental rules about cryptographic hashes is that they are *evenly*distributed* - currently we do not take advantage of our inherent knowledge of the distribution when we look things up, and we use a stupid binary lookup that is guaranteed to basically always take O(log2(n)) random jumps around the data area, with bad locality. In other words, the 256-entry fan-out was meant to avoid the first eight levels of binary lookup. But the thing is, we should be able to generally do a *lot* better than any binary lookup by just doing a LINEAR SEARCH, which is also more cache-friendly, and prefetches much better when it's not in the cache. The only thing we want for a linear search is a good starting point. Which gets us back to the simple: "SHA1 hashes are evenly distributed". In other words, getting a good starting point should be *trivial*. So here's a suggestion: - start finding a range using the 256-entry fan-out, exactly the way we did for the binary search. It's cheap. We could probably avoid EVEN THIS, and just do one more newton-raphson iteration more. But since we have the data, we migth as well use it. After all, it really *is* just a first approximation of newton-raphson, and while it uses only 8 bits (and we could do better), at least it's an *exact* one. - use newton-raphson to iterate closer. It should be a much faster way to find the rough area for the entry we're searching for than binary search. Two or three iterations should get us there, easily. - do a linear search once you're close enough. Here's a stupid patch that does that. It's biggest weakness is that it only does a single iteration of newton-raphson. It should probably do at least one more iteration. I bet that you'd get so close that you find it really quickly if you did that. As it is, it doesn't seem to be any slower than the binary search.. And the nice thing is that I think that in large packs the "evenly distributed" should be *more*true* than in small packs, so this should scale up very nicely, and I'm hoping that it would get much better cache behaviour (because we wouldn't bounce around in the index file too much). I wouldn't be surprised if we basically hit it on the first try if we did two or three iterations of newton-raphson. Comments? Does anybody want to take this and run with it? (No guarantees that this really can ever beat binary search, but somebody might find this interesting. And I think the potential is there, but I think you do need to do at *least* two iterations). Linus --- sha1_file.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++----------- 1 files changed, 64 insertions(+), 15 deletions(-) diff --git a/sha1_file.c b/sha1_file.c index 9c26038..fc4dbc7 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1543,29 +1543,78 @@ int nth_packed_object_sha1(const struct packed_git *p, uint32_t n, return 0; } -off_t find_pack_entry_one(const unsigned char *sha1, - struct packed_git *p) +struct index_entry { + uint32_t nb_offset; /* pack-file offset in network byte order */ + unsigned char sha1[20]; /* SHA1 of the object */ +}; + +static off_t pack_search_backward(const struct index_entry *index, uint32_t guess, const unsigned char *sha1) { - const uint32_t *level1_ofs = p->index_data; - int hi = ntohl(level1_ofs[*sha1]); - int lo = ((*sha1 == 0x0) ? 0 : ntohl(level1_ofs[*sha1 - 1])); - const unsigned char *index = p->index_data; + int count = 1; + while (guess) { + int cmp; - index += 4 * 256; + guess--; + index--; + cmp = hashcmp(index->sha1, sha1); + if (!cmp) + return ntohl(index->nb_offset); + if (cmp < 0) + break; + count++; + } + return 0; +} - do { - int mi = (lo + hi) / 2; - int cmp = hashcmp(index + 24 * mi + 4, sha1); +static off_t pack_search_forward(const struct index_entry *index, uint32_t guess, uint32_t max, const unsigned char *sha1) +{ + int count = 1; + while (++guess < max) { + int cmp; + + index++; + cmp = hashcmp(index->sha1, sha1); if (!cmp) - return ntohl(*((uint32_t *)((char *)index + (24 * mi)))); + return ntohl(index->nb_offset); if (cmp > 0) - hi = mi; - else - lo = mi+1; - } while (lo < hi); + break; + count++; + } return 0; } +off_t find_pack_entry_one(const unsigned char *sha1, + struct packed_git *p) +{ + const struct index_entry *index; + const uint32_t *level1_ofs = p->index_data; + uint32_t min, max, first; + uint32_t hash_val; + uint32_t guess; + int cmp; + + first = *sha1; + min = first ? ntohl(level1_ofs[first - 1]) : 0; + max = ntohl(level1_ofs[first]); + + hash_val = (sha1[1] << 24) | + (sha1[2] << 16) | + (sha1[3] << 8) | + (sha1[4]); + guess = ((uint64_t)(max - min) * hash_val) >> 32; + guess += min; + + index = (const struct index_entry *)((char *) p->index_data + 4 * 256); + index += guess; + + cmp = hashcmp(index->sha1, sha1); + if (!cmp) + return ntohl(index->nb_offset); + if (cmp > 0) + return pack_search_backward(index, guess, sha1); + return pack_search_forward(index, guess, max, sha1); +} + static int matches_pack_name(struct packed_git *p, const char *ig) { const char *last_c, *c; - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html