On 3/28/07, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> I just started experimenting with using git ... > Part of a checkout is about 55GB; > after an initial commit and packing I have a 20GB+ packfile. > Of course this is unusable, ... . I conclude that > for such large projects, git-repack/git-pack-objects would need > new options to control maximum packfile size. Either that, or update the index file format. I think that your approach of having a size limiter is actually the *better* one, though. > [ I don't think this affects git-{fetch,receive,send}-pack > since apparently only the pack is transferred and it only uses > the variable-length size and delta base offset encodings > (of course the accumulation of the 7 bit chunks in a 32b > variable would need to be corrected, but at least the data > format doesn't change).] Well, it does affect fetching, in that "git index-pack" obviously would also need to be taught how to split the resulting indexed packs up into multiple smaller ones from one large incoming one. But that shouldn't be fundamentally hard either, apart from the inconvenience of having to rewrite the object count in the pack headers.. To avoid that issue, it may be that it's actually better to split things up at pack-generation time *even* for the case of --stdout, exactly so that "git index-pack" wouldn't have to split things up (we potentially know a lot more about object sizes up-front at pack-generation time than we know at re-indexing).
The attached patch adds a --pack-limit[=N] option to git-repack/git-pack-objects. N defaults to 1<<31, and the result with --pack-limit is that no packfile can be equal to or larger than N. A --blob-limit=N option is also added (see below). My original plan was simply to ensure that no object started at a file offset not representable in 31 bits. However, I became concerned about the arithmetic involved when mmap'ing a pack, so I decided to make sure *all* bytes lived at offsets representable in 31 bits. Consequently after an object is written out, the new offset is checked. If the limit has been exceeded, the write is rolled back (see sha1mark/sha1undo). This is awkward and inefficient, but yields packs closer to the limit and happens too infrequently to be of much impact. However, there are really two modes when packing: packing to disk, and packing to stdout. Since you can't rollback a write on stdout, the initial file-offset-limit technique is used when --stdout is specified. [Note: I did not *test* the --pack-limit && --stdout combination.] To fully guarantee that a pack file doesn't exceed a certain size, objects above that size must not be packed into it. But I think this makes sense -- I don't see a lot of advantage to packing a 100MB+ object into a pack, except for fetch/send which is a serial stream without index anyway. Thus this patch automatically excludes any object whose uncompressed size is 1/4 or more of the packfile size limit when --stdout is not specified. This behavior can be altered with an explicit --blob-limit=N option. Two interesting twists presented themselves. First, the packfile contains the number of objects in the header at the beginning, and this header is included in the final SHA1. But I don't know the final count until the limit is reached. Consequently the header must be rewritten and the entire file rescanned to make the correct checksum. This already happens in two other places in git. Secondly, when using --pack-limit with --stdout, the header can't be rewritten. Instead the object count in the header is left at 0 to flag that it's wrong. The end of an individual pack inside a multi-pack stream COULD be detected by checking, after each object, if the next 20 bytes are equal to the SHA1 of what's come before. I've made no additional effort beyond this minimal solution because it's not clear that splitting a pack up at the transmitter is better than at the receiver. An alternative method is to add, before the final SHA1, a last object of type OBJ_NONE and length 0 (thus a single zero byte). This would function as an EOF marker. I've indicated where this would go in write_pack_file but didn't put it in since the current code doesn't tolerate a 0 object count in the header anyway (yet?). [Note: I have *not* started in on teaching git-index-pack etc. how to read such concatenated split packs since (a) I'd like to see which way people will prefer and (b) I don't plan on using the feature anyway and I'm wondering if I'm alone in that reaction.] Some code has been added but very few function relationships have been changed, with the exception that write_pack_file now calls write_index_file directly since write_pack_file decides when to split packs and thus must call write_index_file before moving on to the next pack. In response to my original post, I've seen some emails about changing the pack file/index file format. This is exactly what I *didn't* want to do, since (1) it would delay a feature I'd like to use now, (2) the current format is better than people seem to realize, and (3) it would create yet another flag in the config file to help phase in a new feature over a year or two. If, however, there are other pent-up reasons for changing the format which might make it happen sometime soon, I can see some small tweaks that could be useful. * [For stdout/serial access:] Tolerate "0" for object count in a .pack file; it would mean look for the pack end by either matching a SHA1 or looking for an OBJ_NONE/0 record, all as explained above. (The point is to avoid any need to rescan a file to rebuild checksums.) * [For disk/random access:] Don't change the current .pack/.idx files, but do add a third file type which would be a "super index" with a format similar to .idx. It would map sorted SHA1s to (pack#,offset) pairs, either in one table of triples or two parallel tables, one of SHA1s and the other of pairs. It probably would only be used if mentioned in objects/info/packs (and it would be automatically ignored if older than objects/info/packs?). It could be searched by taking advantage of the uniform SHA1 distribution recently discussed. There would be at most one such file in a repository; perhaps the .idx files from which it was generated could be removed. For safety the "super index" could contain a small table of all the SHA1s for the packs it indexes. Thanks, -- Dana L. How danahow@xxxxxxxxx +1 650 804 5991 cell cat GIT-VERSION-FILE GIT_VERSION = 1.5.1.rc2.18.g9c88-dirty
Attachment:
large.patch
Description: Binary data