On Fri, 13 Feb 2015, Duy Nguyen wrote: > After taking 1.5 years "vacation" from pack v4, I plan to do something > about it again. Will post more when I have some patches to discuss. > Only one question for now (forgive me if I asked already, it's been > quite some time) Yeah. I had to re-study my own code before replying. > I think pack v4 does not deliver its best promise that walking a tree > is simply following pointers and jumping from place to place. When we > want to copy from the middle of another tree, we need to scan from the > beginning of the tree. Tree offset cache helps, but the problem > remains. What do you think about an alternative format that each > "copy" instruction includes both index of the tree entry to copy from > (i.e. what we store now) _and_ the byte offset from the beginning of > the tree? With this byte offset, we know exactly where to start > copying without scanning from the beginning. It will be a bit(?) > bigger, but it's also faster. That would make the format inflexible. If we want to do partial repacking by, say, copying some objects and repacking others (some objects might need repacking because the objects they refer to are omitted from the repack) then if those repacked objects are themselves referred to by byte offset then we lose as the offset is no longer valid. > I imagine this is an optimization that can be done locally. The pack > transferred over network does not have these byte offsets. After the > pack is stored and verified by index-pack, we can rewrite it and add > this info. The simplest way is use a fixed size for this offset (e.g. > uint16_t or even uint8_t), add the place holder in copy instructions > of all v4 trees. After that object offsets will not change again and > we can start filling real offsets to placeholders. Having a local extra index is fine. Just like the pack index which is always created locally and can be recreated at any time. Some tree offset cache might be beneficial, but I'd avoid making it into the pack file itself. Yet, I think the biggest problem with pack v4 at the moment is the packing algorithm for tree objects. We are piggy-backing on the pack v2 object delta compression sorting and that produces suboptimal results due to deep recursions. And it is the recursion that kills us. The pack v4 requires a new packing algorithm for its tree objects. What I imagined is something like this: - Each canonical tree entry is made of a SHA1, mode and path. Let's assume this is hashed into a 24-bit value. - Each tree object can therefore be represented as a string of 24-bit "characters". - Delta-compressing a tree object becomes a substring search where we try to replace a sequence of "characters" with the longest "string" possible from another object. Repeat with the remaining sequences. Having a 24-bit hash value is totally arbitrary. It could be 16 bits with more collisions but much faster search and less memory usage. The optimal value would need to be determined after some experimentation. Algorithms for the longest common substring problem already exist. So one of the classical algorithms could probably be adapted here. This would allow for exploiting the provision in pack v4 to copy from more than one tree object. And this would also favor shallower recursions and even smaller packs. Imposing a minimum substring length (rather than a maximum delta depth) would determine the runtime performance when using the pack afterwards. If you have enough free cycles to work on this, that's what I'd suggest you explore at this point. I wish I could myself as I think this ought to be rather cool work. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html