Re: [PATCH/RFC] index-pack: produce pack index version 3

Junio C Hamano <gitster@xxxxxxxxx> · Sun, 12 Aug 2012 12:49:56 -0700

Junio C Hamano <gitster@xxxxxxxxx> writes:

> Nguyễn Thái Ngọc Duy  <pclouds@xxxxxxxxx> writes:
>
>> The main reason to group objects by type is to make it possible to
>> create another sha1->something mapping for a particular object type,
>> without wasting space for storing sha-1 keys again. For example, we
>> can store commit caches, tree caches... at the end of the index as
>> extensions.
>
> Why can't you do the same with a single "sorted by SHA-1" table?
>
> Not impressed yet.

The above should be "Not impressed yet, as it lacks sufficient
explanation of possible future benefits, but the idea is
interesting."

For example, the reachability bitmap would want to say something
like "Traversing from commit A, these objects in this pack are
reachable."  The bitmap for one commit A would logically consist of
N bits for a packfile that stores N objects (the resulting bitmap
needs to be compressed before going to disk, perhaps with RLE or
something).  With the single "sorted by SHA-1" table, we can use the
index in that single table to enumerate all reachable objects of any
type in one go.  With four separate tables, on the other hand, we
would need four bitmaps per commit.

Either way is _possible_, but I think the former is simpler, and the
latter makes it harder to introduce new types of objects in the
future, which I do not think we have examined possible use cases
well enough to make that decision to say "four types is enough
forever".

In either way, we would have such bitmap (or a set of four bitmaps
in your case) for more than one commit (it is not necessary or
desirable to add the reachability bitmap to all commits), and such a
"reachability extension" would need to store a sequence of "the
commit object name the bitmap (or a set of four bitmaps) is about,
and the bitmap (or set of four bitmaps)".  That object name does not
have to be 20-byte but would be a varint representation of the
offset into the "sorted by SHA-1" table.  That varint representation
would be smaller by about 3.5 bits if you have a separate "commit
only, sorted by SHA-1" table (as the number of all objects tend to
be 10x larger than the number of all commits that need them).  For
the particular case of "we want to only annotate the commits, never
other kinds of objects" use case, it would be a win.  But without
knowing what other use cases we will want to use the "object
annotation in the pack index file" mechanism for, it feels like a
premature optimization to me to have four tables to shave 3.5 bits
per object.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html