Re: [PATCH 1/3] Lazily open pack index files on demand

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Sun, 27 May 2007 17:52:45 -0400

Nicolas Pitre <nico@xxxxxxx> wrote:
> On Sat, 26 May 2007, Shawn O. Pearce wrote:
> 
> > Dana How <danahow@xxxxxxxxx> wrote:
> > > Shawn:  When I first saw the index-loading code,  my first
> > > thought was that all the index tables should be
> > > merged (easy since sorted) so callers only need to do one search.
> 
> There is also the question of memory footprint.  If you have a global 
> index, then for each object you need to have a tupple containing SHA1 + 
> pack offset + reference to corresponding pack.  Right now we only need 
> SHA1 + pack offset.

I'm about half-way through a super-index implementation.  Right now
the super-index is defined to be just one index file per repository
(objects/pack/super.index) with a format that looks like the
following:

  header:
    uint32_t sdx_signature ('PSDX')
    uint32_t sdx_version (1)
    uint16_t sdx_packs
    uint8_t  sdx_prefix_len;
    uint8_t  __reserved;

  pack table:
    /* SHA-1 of each pack's sorted SHA-1 object list */
    unsigned char pack_sha1[20][sdx_packs];

  fan-out table:
    /* This is the standard fan-out also used in a tOc/.idx */
    uint32_t fan_out[256];

  records:
    unsigned char prefix[hdr.sdx_prefix_len - 1];
    int8_t        splits[hdr.sdx_packs];

  trailer:
    unsigned char sha1_of_the_above[20];

I build the super-index by merging the .idx of all available
packfiles; since they are already sorted the merge is obviously
quite trivial.

The sdx_prefix_len field is initialized to something that gives a
reasonably unique object name; e.g. in git.git an sdx_prefix_len
of 3 or 4 gets pretty good at narrowing the set of objects down
very small.  The idea here is that the sdx_prefix_len should be
almost the unique abbreviation length for this repository.

We store 1 less byte of the prefix in the record because of the
fan-out table already accounting for the first byte of the prefix.

The splits array contains a single signed integer for each packfile;
if the integer is 0 then that packfile does not contain any object
that starts with that corresponding prefix.  In such a case we
can completely avoid looking at that corresponding packfile.
With my lazy index loading change, I may not even need to open
that index.  ;-)

If the splits array entry is non-zero and is negative, its the number
of times we need to halve down (towards 'lo') to get to entries
that start with this prefix and that are in that packfile's fan-out
table entry range.  If its positive its the number of times we have
to halve up (towards 'hi').  This way we can jump more directly to
the relevant index records and avoid redoing binary search work we
have already accomplished in the super index.

So we can generally build super index records at a cost of 3 or
4 bytes + sdx_packs.  We can also determine which packfile(s) we
need to scan quite quickly, and we can jump further into the part
of the index avoiding a number of expensive hashcmp() calls.  It
may actually be a good savings at runtime, well worth the slightly
higher memory footprint.

My testing is not yet complete, so I cannot offer any hard numbers
against any interesting/common data sets...

> BTW I think the Newton-Raphson based index lookup approach should be 
> revived at some point.

That doesn't help with 10 packfiles though, does it?

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html