On Sun, Jul 7, 2013 at 3:14 AM, Jeff King <peff@xxxxxxxx> wrote: > The pack revindex stores the offsets of the objects in the > pack in sorted order, allowing us to easily find the on-disk > size of each object. To compute it, we populate an array > with the offsets from the sha1-sorted idx file, and then use > qsort to order it by offsets. > > That does O(n log n) offset comparisons, and profiling shows > that we spend most of our time in cmp_offset. However, since > we are sorting on a simple off_t, we can use numeric sorts > that perform better. A radix sort can run in O(k*n), where k > is the number of "digits" in our number. For a 64-bit off_t, > using 16-bit "digits" gives us k=4. Did you try the simple bucket sort Colby now uses in JGit? The sort is pretty simple: bucket_size = pack_length / object_count; buckets[] = malloc(object_count * sizeof(int)); foreach obj in idx: push_chain(buckets[obj.offset / bucket_size], obj.idx_nth); foreach bucket: insertion sort by offset https://eclipse.googlesource.com/jgit/jgit/+/master/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/PackReverseIndex.java We observed on linux.git that most buckets have an average number of objects. IIRC the bucket_size was ~201 bytes and most buckets had very few objects each. For lookups we keep the bucket_size parameter and a bucket index table. This arrangement uses 8 bytes per object in the reverse index, making it very memory efficient. Searches are typically below O(log N) time because each bucket has <log N entries. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html