Dana How <danahow@xxxxxxxxx> writes: > Using fast-import and repack with the max-pack-size patch, > 3628 commits were imported from Perforce comprising > 100.35GB (uncompressed) in 38829 blobs, and saved in > 7 packfiles of 12.5GB total (--window=0 and --depth=0 were > used due to runtime limits). When using these packfiles, > several git commands showed very large process sizes, > and some slowdowns (compared to comparable operations > on the linux kernel repo) were also apparent. > > git stores data in loose blobs or in packfiles. The former > has essentially now become an exception mechanism, to store > exceptionally *young* blobs. Why not use this to store > exceptionally *large* blobs as well? This allows us to > re-use all the "exception" machinery with only a small change. Well, I had an impression that mmapping a single loose object (and then munmapping it after done) would be more expensive than mmapping a whole pack and accessing that object through window, as long as you touch the same set of objects and the object in the pack is not deltified. > Repacking the entire repository with a max-blob-size of 256KB > resulted in a single 13.1MB packfile, as well as 2853 loose > objects totaling 15.4GB compressed and 100.08GB uncompressed, > 11 files per objects/xx directory on average. All was created > in half the runtime of the previous yet with standard > --window=10 and --depth=50 parameters. The data in the > packfile was 270MB uncompressed in 35976 blobs. Operations > such as "git-log --pretty=oneline" were about 30X faster > on a cold cache and 2 to 3X faster otherwise. Process sizes > remained reasonable. I think more reasonable comparison to figure out what is really going on would be to create such a pack with the same 0/0 window and depth (i.e. "keeping the huge objects out of the pack" would be the only difference with the "horrible" case). With huge packs, I wouldn't be surprised if seeking to extract base object from a far away part of a packfile takes a lot longer than reading delta and applying the delta to base object that is kept in the in-core delta base cache. Also if you mean by "process size" the total VM size, not RSS, I think it is a wrong measure. As long as you do not touch the rest of the pack, even if you mmap a huge packfile, you would not bring that much data actually into your main memory, would you? Well, assuming that your mmap() implementation and virtual memory subsystem does a descent job... maybe we are spoiled by Linux here... - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html