Re: [PATCH v2 01/10] Define a structure for object IDs.

Duy Nguyen <pclouds@xxxxxxxxx> · Sat, 14 Mar 2015 18:49:51 +0700

On Fri, Mar 13, 2015 at 1:03 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>>> To me, <pack, offset> information smells to belong more to a "struct
>>> object" (or its subclass) as an optional annotation---when a caller
>>> is asked to parse_object(), you would bypass the sha1_read_file()
>>> that goes and looks the object name up from the list of pack .idx
>>> and instead go there straight using that annotation.
>>
>> For pack v4, commits and trees can be encoded this way.
>
> Even if your in-pack representation of a commit object allowed to
> store the tree pointer in <pack, nth> format, its object name must
> be computed as if you have the commit object in the traditional
> format and computed the hash of that (together with the standard
> "<type> <size>\0" header), and at that point, you need the contained
> object's name in <sha-1> format (imagine how you would implement the
> commit-tree command).

True. But commit-tree is not often executed as, say, rev-list, where
the hash is used as 20-byte key to some pieces somewhere (pack data,
struct object *). If we keep <pack, nth> in object_id (the union
approach), v4-aware code has the chance to go on a fast path.

> Hence, I do not think the v4 encoding changes
> the discussion very much.  I see the primary value of v4 encoding is
> to shorten the length of various fields take on-disk and in-pack.
> If it were <pack, offset>, it could be argued that it would also be
> faster to get to the packed data in the packfile, and going from
> <pack, nth> to the .idx file and then going to the location in the
> data in the packfile would be faster than going from <sha-1> to a
> particular pack and its in-pack offset, with the difference of cost
> around log(n)*m where n is the number of objects in a pack and m is
> the total number of packs in the repository.
>
> It is true that <nth> format (force that the referred-to object
> lives in the same pack as the referrer) can help speed up
> interpretation of extended SHA-1 expression, e.g. "v1.0^0:t", which
> can read v1.0 tag in v4 format, find the <nth> info for the commit
> pointed by the tag and get to that data in the pack, find the <nth>
> info for the top-tree recorded in that commit and directly get to
> the data of that tree, and then find the entry for "t", which will
> give the object name for that subtree again in <nth> format, and at
> that point you can find the <sha-1> of that final object, without
> having to know any object names of the intermediate objects
> (i.e. you must start from <sha-1> of the tag you obtain from the
> refs API, but you didn't use the object name of the commit and its
> top-level tree).  So for such a codepath, I would say it would be
> sufficient to use a "union in struct" people have been envisioning
> and convert <pack, nth> to <sha-1> when the latter form becomes
> necessary for the first time for the object.

It's the point: faster access (to either pack data, or "struct object
*"). Using a separate union from  struct object_id (iow, object_id
only contains SHA-1) works. But imagine to independent code islands
can perform this type of fast access. In order to pass this union from
one island to another, we need to update pretty much every function
call between them. The change to using object_id has a similar impact.
If we go with "union in object_id", we may be able to avoid the second
mass change to make use of <pack, nth>.

> Anyway, wouldn't this be all academic?  I do not see how you would
> keep the object name in the <pack, nth> format in-core, as the
> obj_hash[] is a hashtable keyed by <sha-1>, and even when we switch
> to a different hash, I cannot see how such a table to ensure the
> singleton-ness of in-core objects can be keyed sometimes by <hash>
> and by <pack, nth> in some other time.

I'm implementing something to see how much we gain by avoiding object
lookup. The current approach is having "struct object ** obj" in
"struct packed_git", indexed by "nth". So when you have <pack, nth>
and pack->obj[nth] is valid, you'll get to "struct object *" without
hashing. If pack->obj[nth] is NULL, hashing and looking up are
required the first time using SHA-1, then the result is cached in
pack->obj[]. So no, it's not academic :)
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html