Re: Thoughts about memory requirements in traversals [Was: Re: [RFC] Submodules in GIT]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Josef Weidendorfer wrote:

> On Sunday 03 December 2006 03:46, Shawn Pearce wrote:
>> Josef Weidendorfer <Josef.Weidendorfer@xxxxxx> wrote:
>>> Thinking even one step further:
>>> Would it make sense to define an encoding format for the content of
>>> commit and tree objects inside of packs, where the SHA1 is replaced by
>>> the offset of the object in this pack?
>>> As exactly the SHA1 is the least compressable thing, this could promise
>>> quite a benefit.
>> [...]
>> 
>> This means that when we start to write out a commit we need to know
>> the offset to the tree that commit references.  But git-pack-objects
>> sorts object by type: commit, tree, blob (I forget where tags go,
>> but they aren't important in this context).  So generally *all*
>> commits appear before the first tree.  So when we write out the first
>> commit we need to know exactly how many bytes every commit will need
>> (compressed mind you) in this pack so we can determine the position
>> of the first tree.  Now do this for every commit and every tree
>> that those commits use...  yes, its a lot of work to precompute
>> and store all offsets before you even write out the first byte.
> 
> Yes, it looks like a hen-and-egg problem, but IMHO you can
> handle it nicely with another redirection, i.e. a table you build
> up while repacking the file, and storing this table at the end.
> 
> You simply sequentially renumber any object SHA, starting from 0
> in the order you see them. You can do two renumberings, one for
> the objects contained in the original pack (1), and one for the
> external ones (2). Put these new numbers (with a bit distinguishing
> (1) and (2)) as replacement into commit/tree objects.
> At the end, you have the new offsets for objects in (1). Put
> redirection tables for (1) [new number -> new offset]
> and (2) [other new number->SHA1 of external object] at the end
> of the new pack.
> This way, you effectivly have removed all incompressable SHAs from
> the pack file aside from one entry in the redirection tables for
> each external object.
> 
> The only problem I see is how to decode the objects, i.e. how to
> get the original SHA1 from an offset: we can not recalculate the
> SHA1 from the object content as we changed the content itself.
> But there should be a way to store the SHA1 in front of the object
> somehow, perhaps it is already given by the current format? 
> 
> Am I missing something here?

Doesn't this idea clash with the object and delta reusing for repack?
Hmmm... perhaps with the two indirect tables it wouldn't, only
the tables would need to be recalculated... or perhaps it would because
of offset clashes.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git


-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]