Re: [PATCH 4/4] pack-revindex: radix-sort the revindex

Shawn Pearce <spearce@xxxxxxxxxxx> · Sun, 7 Jul 2013 16:52:23 -0700

On Sun, Jul 7, 2013 at 3:14 AM, Jeff King <peff@xxxxxxxx> wrote:
> The pack revindex stores the offsets of the objects in the
> pack in sorted order, allowing us to easily find the on-disk
> size of each object. To compute it, we populate an array
> with the offsets from the sha1-sorted idx file, and then use
> qsort to order it by offsets.
>
> That does O(n log n) offset comparisons, and profiling shows
> that we spend most of our time in cmp_offset. However, since
> we are sorting on a simple off_t, we can use numeric sorts
> that perform better. A radix sort can run in O(k*n), where k
> is the number of "digits" in our number. For a 64-bit off_t,
> using 16-bit "digits" gives us k=4.

Did you try the simple bucket sort Colby now uses in JGit?

The sort is pretty simple:

  bucket_size = pack_length / object_count;
  buckets[] = malloc(object_count * sizeof(int));

  foreach obj in idx:
    push_chain(buckets[obj.offset / bucket_size], obj.idx_nth);

  foreach bucket:
    insertion sort by offset

https://eclipse.googlesource.com/jgit/jgit/+/master/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/PackReverseIndex.java

We observed on linux.git that most buckets have an average number of
objects. IIRC the bucket_size was ~201 bytes and most buckets had very
few objects each. For lookups we keep the bucket_size parameter and a
bucket index table. This arrangement uses 8 bytes per object in the
reverse index, making it very memory efficient. Searches are typically
below O(log N) time because each bucket has <log N entries.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html