Re: [PATCH v2] Document pack v4 format

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 05 Sep 2013 13:26:10 -0700

Nicolas Pitre <nico@xxxxxxxxxxx> writes:

> On Thu, 5 Sep 2013, Duy Nguyen wrote:
>
>> On Thu, Sep 5, 2013 at 12:39 PM, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:
>> > Now the pack index v3 probably needs to be improved a little, again to
>> > accommodate completion of thin packs.  Given that the main SHA1 table is
>> > now in the main pack file, it should be possible to still carry a small
>> > SHA1 table in the index file that corresponds to the appended objects
>> > only. This means that a SHA1 search will have to first use the main SHA1
>> > table in the pack file as it is done now, and if not found then use the
>> > SHA1 table in the index file if it exists.  And of course
>> > nth_packed_object_sha1() will have to be adjusted accordingly.
>> 
>> What if the sender prepares the sha-1 table to contain missing objects
>> in advance? The sender should know what base objects are missing. Then
>> we only need to append objects at the receiving end and verify that
>> all new objects are also present in the sha-1 table.
>
> I do like this idea very much.  And that doesn't increase the thin pack 
> size as the larger SHA1 table will be compensated by a smaller sha1ref 
> encoding in those objects referring to the missing ones.

Let me see if I understand the proposal correctly.  Compared to a
normal pack-v4 stream, a thin pack-v4 stream:

 - has all the SHA-1 object names involved in the stream in its main
   object name table---most importantly, names of objects that
   "thin" optimization omits from the pack data body are included;

 - uses the SHA-1 object name table offset to refer to other
   objects, even to ones that thin stream will not transfer in the
   pack data body;

 - is completed at the receiving end by appending the data for the
   objects that were not transferred due to the "thin" optimization.

So the invariant "all objects contained in the pack" in:

 - A table of sorted SHA-1 object names for all objects contained in
   the pack.

that appears in Documentation/technical/pack-format.txt is still
kept at the end, and more importantly, any object that is mentioned
in this table can be reconstructed by using pack data in the same
packfile without referencing anything else.  Most importantly, if we
were to build a v2 .idx file for the resulting .pack, the list of
object names in the .idx file would be identical to the object names
in this table in the .pack file.

If that is the case, I too like this.

I briefly wondered if it makes sense to mention objects that are
often referred to that do not exist in the pack in this table
(e.g. new commits included in this pack refer to a tree object that
has not changed for ages---their trees mention this subtree using a
"SHA-1 reference encoding" and being able to name the old,
unchanging tree with an index to the object table may save space),
but that would break the above invariant in a big way---some objects
mentioned in the table may not exist in the packfile itself---and it
probably is not a good idea.  Unlike that broken idea, "include
names of the objects that will be appended anyway" approach to help
fattening a thin-pack makes very good sense to me.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html