Nicolas Pitre <nico@xxxxxxx> wrote: > On Sat, 26 May 2007, Shawn O. Pearce wrote: > > > Dana How <danahow@xxxxxxxxx> wrote: > > > Shawn: When I first saw the index-loading code, my first > > > thought was that all the index tables should be > > > merged (easy since sorted) so callers only need to do one search. > > There is also the question of memory footprint. If you have a global > index, then for each object you need to have a tupple containing SHA1 + > pack offset + reference to corresponding pack. Right now we only need > SHA1 + pack offset. I'm about half-way through a super-index implementation. Right now the super-index is defined to be just one index file per repository (objects/pack/super.index) with a format that looks like the following: header: uint32_t sdx_signature ('PSDX') uint32_t sdx_version (1) uint16_t sdx_packs uint8_t sdx_prefix_len; uint8_t __reserved; pack table: /* SHA-1 of each pack's sorted SHA-1 object list */ unsigned char pack_sha1[20][sdx_packs]; fan-out table: /* This is the standard fan-out also used in a tOc/.idx */ uint32_t fan_out[256]; records: unsigned char prefix[hdr.sdx_prefix_len - 1]; int8_t splits[hdr.sdx_packs]; trailer: unsigned char sha1_of_the_above[20]; I build the super-index by merging the .idx of all available packfiles; since they are already sorted the merge is obviously quite trivial. The sdx_prefix_len field is initialized to something that gives a reasonably unique object name; e.g. in git.git an sdx_prefix_len of 3 or 4 gets pretty good at narrowing the set of objects down very small. The idea here is that the sdx_prefix_len should be almost the unique abbreviation length for this repository. We store 1 less byte of the prefix in the record because of the fan-out table already accounting for the first byte of the prefix. The splits array contains a single signed integer for each packfile; if the integer is 0 then that packfile does not contain any object that starts with that corresponding prefix. In such a case we can completely avoid looking at that corresponding packfile. With my lazy index loading change, I may not even need to open that index. ;-) If the splits array entry is non-zero and is negative, its the number of times we need to halve down (towards 'lo') to get to entries that start with this prefix and that are in that packfile's fan-out table entry range. If its positive its the number of times we have to halve up (towards 'hi'). This way we can jump more directly to the relevant index records and avoid redoing binary search work we have already accomplished in the super index. So we can generally build super index records at a cost of 3 or 4 bytes + sdx_packs. We can also determine which packfile(s) we need to scan quite quickly, and we can jump further into the part of the index avoiding a number of expensive hashcmp() calls. It may actually be a good savings at runtime, well worth the slightly higher memory footprint. My testing is not yet complete, so I cannot offer any hard numbers against any interesting/common data sets... > BTW I think the Newton-Raphson based index lookup approach should be > revived at some point. That doesn't help with 10 packfiles though, does it? -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html