On 5/21/07, Junio C Hamano <junkio@xxxxxxx> wrote:
Dana How <danahow@xxxxxxxxx> writes: > git stores data in loose blobs or in packfiles. The former > has essentially now become an exception mechanism, to store > exceptionally *young* blobs. Why not use this to store > exceptionally *large* blobs as well? This allows us to > re-use all the "exception" machinery with only a small change. Well, I had an impression that mmapping a single loose object (and then munmapping it after done) would be more expensive than mmapping a whole pack and accessing that object through window, as long as you touch the same set of objects and the object in the pack is not deltified.
I agree with your comparison. However, if I'm processing a 100MB+ blob, I doubt the extra open/mmap/munmap/close calls are going to matter to me. What I think _helped_ me was that, with the megablobs pushed out of the pack, git-log etc could play around inside a "tiny" 13MB packfile very quickly. This packfile contained all the commits, all the trees, and all the blobs < 256KB.
> Repacking the entire repository with a max-blob-size of 256KB > resulted in a single 13.1MB packfile, as well as 2853 loose > objects totaling 15.4GB compressed and 100.08GB uncompressed, > 11 files per objects/xx directory on average. All was created > in half the runtime of the previous yet with standard > --window=10 and --depth=50 parameters. The data in the > packfile was 270MB uncompressed in 35976 blobs. Operations > such as "git-log --pretty=oneline" were about 30X faster > on a cold cache and 2 to 3X faster otherwise. Process sizes > remained reasonable. I think more reasonable comparison to figure out what is really going on would be to create such a pack with the same 0/0 window and depth (i.e. "keeping the huge objects out of the pack" would be the only difference with the "horrible" case). With huge packs, I wouldn't be surprised if seeking to extract base object from a far away part of a packfile takes a lot longer than reading delta and applying the delta to base object that is kept in the in-core delta base cache.
Yes, changing only one variable at a time would be better. I will do that experiment. However, the huge pack _did_ have 0/0, and the small pack had default/default, which I think is the reverse of what you concluded above?, so the experiment should make things no better for the huge pack case.
Also if you mean by "process size" the total VM size, not RSS, I think it is a wrong measure. As long as you do not touch the rest of the pack, even if you mmap a huge packfile, you would not bring that much data actually into your main memory, would you? Well, assuming that your mmap() implementation and virtual memory subsystem does a descent job... maybe we are spoiled by Linux here...
You are right that the VM number was more shocking, but both were too high. But let's compare using 12GB+ of packfiles versus 13MB. In the former case, I'm depending on the sliding mmap windows doing the right thing in an operating regime no one uses (which is why Shawn was asking about my packedGitLimit settings etc), and in the latter case, the packfile is <10% of the linux2.6 packfile but I have to endure an extra open/mmap/munmap/close sequence when accessing enormouse files. The small extra cost of the latter is more attractive to me than an unknown amount of tuning to get the former right, and in the former case I still have to figure out how to *create* the packfiles efficiently. There's actually an even more extreme example from my day job. The software team has a project whose files/revisions would be similar to those in the linux kernel (larger commits, I'm sure). But they have *ONE* 500MB file they check in because it takes 2 or 3 days to generate and different people use different versions of it. I'm sure it has 50+ revisions now. If they converted to git and included these blobs in their packfile, that's a 25GB uncompressed increase! *Every* git operation must wade through 10X -- 100X more packfile. Or it could be kept in 50+ loose objects in objects/xx , requiring a few extra syscalls by each user to get a new version. Thanks, -- Dana L. How danahow@xxxxxxxxx +1 650 804 5991 cell - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html