On Sun, Jul 07, 2013 at 04:52:23PM -0700, Shawn O. Pearce wrote: > On Sun, Jul 7, 2013 at 3:14 AM, Jeff King <peff@xxxxxxxx> wrote: > > The pack revindex stores the offsets of the objects in the > > pack in sorted order, allowing us to easily find the on-disk > > size of each object. To compute it, we populate an array > > with the offsets from the sha1-sorted idx file, and then use > > qsort to order it by offsets. > > > > That does O(n log n) offset comparisons, and profiling shows > > that we spend most of our time in cmp_offset. However, since > > we are sorting on a simple off_t, we can use numeric sorts > > that perform better. A radix sort can run in O(k*n), where k > > is the number of "digits" in our number. For a 64-bit off_t, > > using 16-bit "digits" gives us k=4. > > Did you try the simple bucket sort Colby now uses in JGit? > > The sort is pretty simple: > > bucket_size = pack_length / object_count; > buckets[] = malloc(object_count * sizeof(int)); > > foreach obj in idx: > push_chain(buckets[obj.offset / bucket_size], obj.idx_nth); > > foreach bucket: > insertion sort by offset I did do something similar (though I flattened my buckets into a single list afterwards), but I ended up closer to 700ms (down from 830ms, but with the radix sort around 200ms). It's entirely possible I screwed up something in the implementation (the bucket insertion can be done in a lot of different ways, many of which are terrible), but I didn't keep a copy of that attempt. If you try it and have better numbers, I'd be happy to see them. > We observed on linux.git that most buckets have an average number of > objects. IIRC the bucket_size was ~201 bytes and most buckets had very > few objects each. For lookups we keep the bucket_size parameter and a > bucket index table. This arrangement uses 8 bytes per object in the > reverse index, making it very memory efficient. Searches are typically > below O(log N) time because each bucket has <log N entries. I didn't measure lookups at all; I was focused on time to build the index. So if there were benefits there that make up for a longer setup time, I wouldn't have measured them (of course, we also care about the case with few lookups, so it would be a tradeoff). You could also leave each bucket unsorted and only lazily sort it when a lookup hits the bucket, which might help that case (I didn't look to see if you do that in JGit). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html