Re: Thoughts about memory requirements in traversals [Was: Re: [RFC] Submodules in GIT]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Josef Weidendorfer <Josef.Weidendorfer@xxxxxx> wrote:
> Thinking even one step further:
> Would it make sense to define an encoding format for the content of
> commit and tree objects inside of packs, where the SHA1 is replaced by the
> offset of the object in this pack?
> As exactly the SHA1 is the least compressable thing, this could promise
> quite a benefit.

I actually had the same idea the other day.  I discarded it after
thinking about it for a minute.  Here's the problem:

Lets say we do this for the tree and parent IDs in a commit, because
these are the most commonly needed part of a commit during revision
traversal.  So we want to put the offset to the tree and the offset
to each parent at the front of the commit somehow to make them very
cheap to access.

This means that when we start to write out a commit we need to know
the offset to the tree that commit references.  But git-pack-objects
sorts object by type: commit, tree, blob (I forget where tags go,
but they aren't important in this context).  So generally *all*
commits appear before the first tree.  So when we write out the first
commit we need to know exactly how many bytes every commit will need
(compressed mind you) in this pack so we can determine the position
of the first tree.  Now do this for every commit and every tree
that those commits use...  yes, its a lot of work to precompute
and store all offsets before you even write out the first byte.

Its even worse with parent commits because ancestors tend to appear
behind the commit (newest->oldest) so that "git log" can benefit
from OS read-ahead.  So you also have to keep track of your parent
commmit offsets.  Not pretty.

Extending that idea to tree objects (store the offset of the entry)
makes the issue even uglier.

Oh, and packs aren't entirely self-contained.  A pack is only self
contained in the sense that no object in the pack deltafies against
an object outside of the pack[1].  However by design an object
(e.g. a commit or a tree) can reference an object which is either
loose or which is in another pack.  This is especially important
for every large projects where not every commit/tree/tag/blob will
fit into one 4 giB file.

**1** Except in the case of thin packs, which are used only on the
network and only to save bandwidth.

> AFAIK, we currently only use these offsets for referencing objects in
> delta chains.

Yes, that's a recent feature to reference a delta base.

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]