On (re)packing

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 04 Feb 2013 22:35:48 -0800

Just a random thought...

"git pack-objects" runs "rev-list --objects" to obtain three
different kinds of information about objects:

 * Reachability.  Objects that appear in the output of rev-list are
   relevant, and those that do not are not;

 * Recency, aka closeness in the time dimension.  Objects that
   appear in the output of rev-list close together appear in the
   history nearby (a commit and its parents, a commit and its tree,
   a tree and its subtree and blob);

 * Paths, aka closeness in the space dimension.  Trees and blobs are
   labelled with paths they are first discovered at.

All of these are necessary for reasonable packing.  We choose only
the reachable objects to pack.  We place objects close in the time
dimension together in the output pack stream.  And we delta objects
close in the space dimension against each other.

Now, we are planning to use object reachability bitmap to speed up
the object enumeration phase of repacking.  This primarily gives us
the reachability information.  Placing objects that are packed close
in existing packs close to each other in the output stream hopefully
may give us good enough approximation of the recency order.  And
reusing the delta will keep good existing delta if both deltified
object and its delta base object are included in the output.

What should happen when we repack two or more packs?  Placing an
object that appears near the beginning of one pack close to another
object that appears near the beginning of the other pack would
probably not make much sense, so the natural extension of emulating
recency order by pack offset would be to place all objects from one
pack together (in their original in-pack offset orde) and then
objects from the other pack together.

More problematic is how to coalesce one delta chain from one pack
with another delta chain from the other pack.  Without having the
path information, we cannot efficiently do this.  Especially if we
are to repack 50 small packs into one, it would be desirable if we
can avoid packing 50 similar objects in their undeltified form.

Assuming that we would want some extra information, in addition to
the bitmaps, to help repacking, and also with the recent "commit
metadata" topic in mind, lets do a back-of-the-envelope enumeration
of what we would want.  Here is a list off the top of my head.

* commits

  99.9% of all commits have no more than two parents.  Assuming that
  we keep this information per-pack, and we will not be packing more
  than 2^32 objects in a pack, we can represent up to two parent
  commits by two int32s to represent their position in the table of
  SHA-1's in the .idx file.  Also we can store another uint32 for
  the commit timestamp (compute the timestamp of the oldest commit
  in the pack and store it as uint64, and represent the commmit
  timestamp of each commit with uint32 (we can represent a pack that
  spans 68 years with this scheme).  Give up storing metainformation
  for any commit whose information cannot fit this scheme (e.g. an
  octopus, or a commit with timestamp way out of line).

  That costs 12 bytes per commit (plus 8-byte for the oldest
  timestamp in the pack).

* tags

  The timestamp can be expressed with a uint32 in the same way as
  commit's timestamp, and the tagged object can be expressed with a
  uint32 in the same way as commit's parent.  Two uint32s

* trees and blobs

  The name hash value is uint32.

If we have other information that are needed for trees and blobs
that fit in two more uint32s, then we can just build a table of
12-byte entries for all objects in the pack.  

Otherwise, we would need a way to have separate table that can be
quickly accessed, going from an object name to either 3 uint32 tuple
(if it is a commit), 2 uint32 tuple (if it is a tag), or a uint32
(others).

Whether we can do a table with uniform entry size or split tables,
once we locate an entry, we can find out the object number in the
SHA-1 table the object refers to (e.g. parents, or tagged) so I do
not think it is a good idea to express object pointer with an
abbreviated SHA-1.  But if we have to build commit-only table, we
may need to use a table with 16-byte entry size, that has 8
hexdigits (i.e. 4-bytes) abbreviated object name of a commit,
followed by the 12 byte commit metainformation, and sort it by the
abbreviated object name.  With the same scheme, another table for
trees and blobs will cost 4-bytes for abbreviated object name as key
and 4-bytes for name-hash payload.  That sounds a bit too wasteful
to my taste.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html