Re: [PATCH v2 01/10] Define a structure for object IDs.

Duy Nguyen <pclouds@xxxxxxxxx> · Fri, 13 Mar 2015 07:58:20 +0700

On Fri, Mar 13, 2015 at 1:24 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Duy Nguyen <pclouds@xxxxxxxxx> writes:
>
>> This may or may not fall into the "mix different hash functions"
>> category. In pack files version 4, trees are encoded to point to other
>> trees or blobs by a (pack, offset) tuple. It would be great if the new
>> object_id could support carrying this kind of object id around because
>> it could help reduce object lookup cost a lot. (pack, offset) can be
>> converted back to SHA-1 so no info is lost and hashcmp() can compare
>> (pack, tuple) against an SHA-1 just fine.
>
> You mean "if it came in <pack, offset> format, convert it down to
> <sha1> until the last second that it is needed (e.g. need to put
> that in a tree object in order to compute the object name of the
> containing tree object)"?

I picked my words poorly. It should be <pack, the index in pack>
instead of the _byte_ offset. This index ranges from 0 to the number
of objects minus one. This is essentially what pack v4 stores in
places where pack v2 would store SHA-1, which makes me make the
connection: both are different ways of identifying an object.

> After converting an object name originally represented as <pack,
> offset>, if we are doing the "union in struct" thing, to <sha1>
> representation, you would have to look it up from .idx in order to
> read the contents the usual way.  If that happens often enough, then
> it may not be worth adding complexity to the code to carry the
> <pack, offset> pair around.

I'd keep it in <pack, index> for as long as possible (or until higher
layer decides that it's not worth keeping in this representation
anymore). I can only see two use cases where actual SHA-1 is involved:
sorting by SHA-1 (rarely done, probably only in index-pack and
pack-objects) and output to the screen (or producing v2 packs).

> Unless you fix that "union in struct" assumption, that is.
>
> To me, <pack, offset> information smells to belong more to a "struct
> object" (or its subclass) as an optional annotation---when a caller
> is asked to parse_object(), you would bypass the sha1_read_file()
> that goes and looks the object name up from the list of pack .idx
> and instead go there straight using that annotation.

For pack v4, commits and trees can be encoded this way. parse_object()
could help the commit case (maybe, maybe not, I haven't looked at
commit walkers), not recursive tree walking where we now pass SHA-1
around to user callback and all. Carrying optional annotation around
would impact the code in many places.  Also, storing this info in
struct object seems conflict with v4 goal of reducing "struct object"
lookup cost.  Maybe I'm missing something here.

Then again, we don't know if pack v4 will get merged in the end (I do
hope it will). And we have an option of making specialized commit/tree
walkers that are aware of pack v4 and only use them in hot places to
reduce impact to the rest of the code base. If hash[GIT_MAX_RAWSZ]
looks like a good enough solution, we can go with that and worry about
pack v4 later when/if it comes.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html