Josef Weidendorfer <Josef.Weidendorfer@xxxxxx> wrote: > Thinking even one step further: > Would it make sense to define an encoding format for the content of > commit and tree objects inside of packs, where the SHA1 is replaced by the > offset of the object in this pack? > As exactly the SHA1 is the least compressable thing, this could promise > quite a benefit. I actually had the same idea the other day. I discarded it after thinking about it for a minute. Here's the problem: Lets say we do this for the tree and parent IDs in a commit, because these are the most commonly needed part of a commit during revision traversal. So we want to put the offset to the tree and the offset to each parent at the front of the commit somehow to make them very cheap to access. This means that when we start to write out a commit we need to know the offset to the tree that commit references. But git-pack-objects sorts object by type: commit, tree, blob (I forget where tags go, but they aren't important in this context). So generally *all* commits appear before the first tree. So when we write out the first commit we need to know exactly how many bytes every commit will need (compressed mind you) in this pack so we can determine the position of the first tree. Now do this for every commit and every tree that those commits use... yes, its a lot of work to precompute and store all offsets before you even write out the first byte. Its even worse with parent commits because ancestors tend to appear behind the commit (newest->oldest) so that "git log" can benefit from OS read-ahead. So you also have to keep track of your parent commmit offsets. Not pretty. Extending that idea to tree objects (store the offset of the entry) makes the issue even uglier. Oh, and packs aren't entirely self-contained. A pack is only self contained in the sense that no object in the pack deltafies against an object outside of the pack[1]. However by design an object (e.g. a commit or a tree) can reference an object which is either loose or which is in another pack. This is especially important for every large projects where not every commit/tree/tag/blob will fit into one 4 giB file. **1** Except in the case of thin packs, which are used only on the network and only to save bandwidth. > AFAIK, we currently only use these offsets for referencing objects in > delta chains. Yes, that's a recent feature to reference a delta base. -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html