Just a random thought... "git pack-objects" runs "rev-list --objects" to obtain three different kinds of information about objects: * Reachability. Objects that appear in the output of rev-list are relevant, and those that do not are not; * Recency, aka closeness in the time dimension. Objects that appear in the output of rev-list close together appear in the history nearby (a commit and its parents, a commit and its tree, a tree and its subtree and blob); * Paths, aka closeness in the space dimension. Trees and blobs are labelled with paths they are first discovered at. All of these are necessary for reasonable packing. We choose only the reachable objects to pack. We place objects close in the time dimension together in the output pack stream. And we delta objects close in the space dimension against each other. Now, we are planning to use object reachability bitmap to speed up the object enumeration phase of repacking. This primarily gives us the reachability information. Placing objects that are packed close in existing packs close to each other in the output stream hopefully may give us good enough approximation of the recency order. And reusing the delta will keep good existing delta if both deltified object and its delta base object are included in the output. What should happen when we repack two or more packs? Placing an object that appears near the beginning of one pack close to another object that appears near the beginning of the other pack would probably not make much sense, so the natural extension of emulating recency order by pack offset would be to place all objects from one pack together (in their original in-pack offset orde) and then objects from the other pack together. More problematic is how to coalesce one delta chain from one pack with another delta chain from the other pack. Without having the path information, we cannot efficiently do this. Especially if we are to repack 50 small packs into one, it would be desirable if we can avoid packing 50 similar objects in their undeltified form. Assuming that we would want some extra information, in addition to the bitmaps, to help repacking, and also with the recent "commit metadata" topic in mind, lets do a back-of-the-envelope enumeration of what we would want. Here is a list off the top of my head. * commits 99.9% of all commits have no more than two parents. Assuming that we keep this information per-pack, and we will not be packing more than 2^32 objects in a pack, we can represent up to two parent commits by two int32s to represent their position in the table of SHA-1's in the .idx file. Also we can store another uint32 for the commit timestamp (compute the timestamp of the oldest commit in the pack and store it as uint64, and represent the commmit timestamp of each commit with uint32 (we can represent a pack that spans 68 years with this scheme). Give up storing metainformation for any commit whose information cannot fit this scheme (e.g. an octopus, or a commit with timestamp way out of line). That costs 12 bytes per commit (plus 8-byte for the oldest timestamp in the pack). * tags The timestamp can be expressed with a uint32 in the same way as commit's timestamp, and the tagged object can be expressed with a uint32 in the same way as commit's parent. Two uint32s * trees and blobs The name hash value is uint32. If we have other information that are needed for trees and blobs that fit in two more uint32s, then we can just build a table of 12-byte entries for all objects in the pack. Otherwise, we would need a way to have separate table that can be quickly accessed, going from an object name to either 3 uint32 tuple (if it is a commit), 2 uint32 tuple (if it is a tag), or a uint32 (others). Whether we can do a table with uniform entry size or split tables, once we locate an entry, we can find out the object number in the SHA-1 table the object refers to (e.g. parents, or tagged) so I do not think it is a good idea to express object pointer with an abbreviated SHA-1. But if we have to build commit-only table, we may need to use a table with 16-byte entry size, that has 8 hexdigits (i.e. 4-bytes) abbreviated object name of a commit, followed by the 12 byte commit metainformation, and sort it by the abbreviated object name. With the same scheme, another table for trees and blobs will cost 4-bytes for abbreviated object name as key and 4-bytes for name-hash payload. That sounds a bit too wasteful to my taste. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html