Re: [PATCH 4/4] pack-revindex: radix-sort the revindex

Jeff King <peff@xxxxxxxx> · Mon, 8 Jul 2013 03:57:12 -0400

On Sun, Jul 07, 2013 at 04:52:23PM -0700, Shawn O. Pearce wrote:

> On Sun, Jul 7, 2013 at 3:14 AM, Jeff King <peff@xxxxxxxx> wrote:
> > The pack revindex stores the offsets of the objects in the
> > pack in sorted order, allowing us to easily find the on-disk
> > size of each object. To compute it, we populate an array
> > with the offsets from the sha1-sorted idx file, and then use
> > qsort to order it by offsets.
> >
> > That does O(n log n) offset comparisons, and profiling shows
> > that we spend most of our time in cmp_offset. However, since
> > we are sorting on a simple off_t, we can use numeric sorts
> > that perform better. A radix sort can run in O(k*n), where k
> > is the number of "digits" in our number. For a 64-bit off_t,
> > using 16-bit "digits" gives us k=4.
> 
> Did you try the simple bucket sort Colby now uses in JGit?
> 
> The sort is pretty simple:
> 
>   bucket_size = pack_length / object_count;
>   buckets[] = malloc(object_count * sizeof(int));
> 
>   foreach obj in idx:
>     push_chain(buckets[obj.offset / bucket_size], obj.idx_nth);
> 
>   foreach bucket:
>     insertion sort by offset

I did do something similar (though I flattened my buckets into a single
list afterwards), but I ended up closer to 700ms (down from 830ms, but
with the radix sort around 200ms). It's entirely possible I screwed up
something in the implementation (the bucket insertion can be done in a
lot of different ways, many of which are terrible), but I didn't keep a
copy of that attempt. If you try it and have better numbers, I'd be
happy to see them.

> We observed on linux.git that most buckets have an average number of
> objects. IIRC the bucket_size was ~201 bytes and most buckets had very
> few objects each. For lookups we keep the bucket_size parameter and a
> bucket index table. This arrangement uses 8 bytes per object in the
> reverse index, making it very memory efficient. Searches are typically
> below O(log N) time because each bucket has <log N entries.

I didn't measure lookups at all; I was focused on time to build the
index. So if there were benefits there that make up for a longer setup
time, I wouldn't have measured them (of course, we also care about the
case with few lookups, so it would be a tradeoff). You could also leave
each bucket unsorted and only lazily sort it when a lookup hits the
bucket, which might help that case (I didn't look to see if you do that
in JGit).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html