On Wed, 28 Mar 2007, Dana How wrote: > > I just started experimenting with using git on > a large engineering project which has used p4 so far. > Part of a checkout is about 55GB; > after an initial commit and packing I have a 20GB+ packfile. Oh wow. You don't do half measures, do you ;) > Of course this is unusable, since object_entry's in an .idx > file have only 32 bits in their offset fields. I conclude that > for such large projects, git-repack/git-pack-objects would need > new options to control maximum packfile size. Either that, or update the index file format. I think that your approach of having a size limiter is actually the *better* one, though. > [ I don't think this affects git-{fetch,receive,send}-pack > since apparently only the pack is transferred and it only uses > the variable-length size and delta base offset encodings > (of course the accumulation of the 7 bit chunks in a 32b > variable would need to be corrected, but at least the data > format doesn't change).] Well, it does affect fetching, in that "git index-pack" obviously would also need to be taught how to split the resulting indexed packs up into multiple smaller ones from one large incoming one. But that shouldn't be fundamentally hard either, apart from the inconvenience of having to rewrite the object count in the pack headers.. To avoid that issue, it may be that it's actually better to split things up at pack-generation time *even* for the case of --stdout, exactly so that "git index-pack" wouldn't have to split things up (we potentially know a lot more about object sizes up-front at pack-generation time than we know at re-indexing). > So I am toying with adding a --limit <size> flag to > git-repack/git-pack-objects. Sounds very sane. > This cannot be used with --stdout. If specified, e.g. > git-repack --limit 2g > then each packfile created could be at most 2^31-1 bytes in size. Sounds good, apart from the caveat above about "--stdout" that needs some thinking about. > It's possible that multiple packfiles would be created in one shot. > Thus git-pack-objects could write multiple names to stdout > and git-repack would need to be updated accordingly. Yes. That seems to be the least of all problems. > Finally, I wonder if having tree/commit/tag objects mixed into > such large packfiles would be a performance hit. My initial reaction is that it's best to start off without worrying about that, and just do everything in the order that we do now (ie "sort by type first, recency second") and just split when we hit the size limit. But if you actually want to experiment with different organizations, I don't think that's wrong. I would just personally start without it. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html