Re: [RFC] Packing large repositories

"Dana How" <danahow@xxxxxxxxx> · Mon, 2 Apr 2007 14:19:29 -0700

On 3/28/07, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> I just started experimenting with using git ...
> Part of a checkout is about 55GB;
> after an initial commit and packing I have a 20GB+ packfile.
> Of course this is unusable, ... . I conclude that
> for such large projects,  git-repack/git-pack-objects would need
> new options to control maximum packfile size.

Either that, or update the index file format. I think that your approach
of having a size limiter is actually the *better* one, though.

> [ I don't think this affects git-{fetch,receive,send}-pack
> since apparently only the pack is transferred and it only uses
> the variable-length size and delta base offset encodings
> (of course the accumulation of the 7 bit chunks in a 32b
> variable would need to be corrected, but at least the data
> format doesn't change).]

Well, it does affect fetching, in that "git index-pack" obviously would
also need to be taught how to split the resulting indexed packs up into
multiple smaller ones from one large incoming one. But that shouldn't be
fundamentally hard either, apart from the inconvenience of having to
rewrite the object count in the pack headers..

To avoid that issue, it may be that it's actually better to split things
up at pack-generation time *even* for the case of --stdout, exactly so
that "git index-pack" wouldn't have to split things up (we potentially
know a lot more about object sizes up-front at pack-generation time than
we know at re-indexing).

The attached patch adds a --pack-limit[=N] option
to git-repack/git-pack-objects.  N defaults to 1<<31,
and the result with --pack-limit is that no packfile
can be equal to or larger than N.  A --blob-limit=N
option is also added (see below).

My original plan was simply to ensure that no
object started at a file offset not representable in 31 bits.
However,  I became concerned about the arithmetic involved
when mmap'ing a pack,  so I decided to make sure *all*
bytes lived at offsets representable in 31 bits.

Consequently after an object is written out,
the new offset is checked.  If the limit has been exceeded,
the write is rolled back (see sha1mark/sha1undo).
This is awkward and inefficient,  but yields packs closer
to the limit and happens too infrequently to be of much impact.

However, there are really two modes when packing:
packing to disk, and packing to stdout.
Since you can't rollback a write on stdout,  the initial
file-offset-limit technique is used when --stdout is specified.
[Note:  I did not *test* the --pack-limit && --stdout combination.]

To fully guarantee that a pack file doesn't exceed a certain size,
objects above that size must not be packed into it.
But I think this makes sense -- I don't see a lot of advantage
to packing a 100MB+ object into a pack,  except for fetch/send
which is a serial stream without index anyway.  Thus
this patch automatically excludes any object whose uncompressed
size is 1/4 or more of the packfile size limit when --stdout
is not specified.  This behavior can be altered with an explicit
--blob-limit=N option.

Two interesting twists presented themselves.
First,  the packfile contains the number of objects in
the header at the beginning,  and this header is included
in the final SHA1.  But I don't know the final count until
the limit is reached.  Consequently the header must be
rewritten and the entire file rescanned to make the correct
checksum.  This already happens in two other places in git.

Secondly,  when using --pack-limit with --stdout,  the header
can't be rewritten.  Instead the object count in the header
is left at 0 to flag that it's wrong.  The end of an individual
pack inside a multi-pack stream COULD be detected by checking,
after each object,  if the next 20 bytes are equal to the SHA1
of what's come before.  I've made no additional effort beyond
this minimal solution because it's not clear that splitting
a pack up at the transmitter is better than at the receiver.
An alternative method is to add,  before the final SHA1,  a last
object of type OBJ_NONE and length 0 (thus a single zero byte).
This would function as an EOF marker.  I've indicated where this
would go in write_pack_file but didn't put it in since the current
code doesn't tolerate a 0 object count in the header anyway (yet?).
[Note: I have *not* started in on teaching git-index-pack etc.
how to read such concatenated split packs since (a) I'd like
to see which way people will prefer and (b) I don't plan on
using the feature anyway and I'm wondering if I'm alone
in that reaction.]

Some code has been added but
very few function relationships have been changed,
with the exception that write_pack_file now calls write_index_file
directly since write_pack_file decides when to split packs
and thus must call write_index_file before moving on to the next pack.

In response to my original post,  I've seen some emails about
changing the pack file/index file format.  This is exactly what I
*didn't* want to do,  since (1) it would delay a feature I'd like to
use now,  (2) the current format is better than people seem to realize,
and (3) it would create yet another flag in the config file
to help phase in a new feature over a year or two.

If,  however,  there are other pent-up reasons for changing the format
which might make it happen sometime soon,  I can see some small tweaks
that could be useful.

* [For stdout/serial access:] Tolerate "0" for object count in a .pack
file;  it would mean look for the pack end by either matching a SHA1 or
looking for an OBJ_NONE/0 record,  all as explained above.
(The point is to avoid any need to rescan a file to rebuild checksums.)

* [For disk/random access:] Don't change the current .pack/.idx files,
but do add a third file type which would be a "super index" with a format
similar to .idx.  It would map sorted SHA1s to (pack#,offset) pairs,
either in one table of triples or two parallel tables, one of SHA1s and
the other of pairs.  It probably would only be used if mentioned in
objects/info/packs (and it would be automatically ignored if older than
objects/info/packs?).  It could be searched by taking advantage of the
uniform SHA1 distribution recently discussed.  There would be at most
one such file in a repository;  perhaps the .idx files from which it was
generated could be removed.  For safety the "super index" could contain
a small table of all the SHA1s for the packs it indexes.

Thanks,
--
Dana L. How  danahow@xxxxxxxxx  +1 650 804 5991 cell

cat GIT-VERSION-FILE
GIT_VERSION = 1.5.1.rc2.18.g9c88-dirty
Attachment:
large.patch

Description: Binary data