On 5/24/07, Shawn O. Pearce <spearce@xxxxxxxxxxx> wrote:
Junio C Hamano <junkio@xxxxxxx> wrote: > "Dana How" <danahow@xxxxxxxxx> writes: > > We have three options in this case: > > (1) Drop the object (do not put it in the new pack(s)). > > (2) Pass the object into the new pack(s). > > (3) Write out the object as a new loose object. > > Option (1) is unacceptable. When you call git-repack -a, > > it blindly deletes all the non-kept packs at the end. So > > the megablobs would be lost. > Ok, I can buy that -- (1) nor (2) are unacceptable and (3) is > the only sane thing to do for a previously packed objects that > exceed the size limit. I still don't buy the idea that these megablobs shouldn't be packed. I understand Dana's pain here (at least a little bit, my problems aren't as bad as his are), but I also hate to see us run away from packfiles for these really sick cases just because we have some issues in our current packfile handling. Packfiles give us a lot of benefits: 1) less inode usage;
I agree with Geert that blowing an inode on a 100MB+ object is no big deal.
2) transport can write directly to local disk; 3) transport can (quickly) copy from local disk;
For (2) and (3) see comments on next para plus NFS discussion.
4) testing for existance is *much* faster;
This is true. But I don't care about this cost if it is only incurred on large objects which are leaf nodes in the git "data relationship tree" (tags->commits->trees->blobs) anyway.
5) deltafication is possible;
Again Geert made a good argument that didn't occur to me that you definitely DON'T want to do deltification on such large objects. Junio recently added delta/nodelta attribute; this would be useful to me, but unfortunately I have several continua of files, each with the same suffix, but with largely varying sizes, so attributes won't help me unless the name globs in .gitattributes are expanded to full expressions similar to find(1) [i.e. include testing based on size, perms, type], which I think would be insane.
Now #3 is actually really important here. Don't forget that we *just* disabled the fancy "new loose object format". It doesn't exist. We can read the packfile-like loose objects, but we cannot write them anymore. So lets say we explode a megablob into a loose object, and its 800 MiB by itself. Now we have to send that object to a client. Yes, that's right, we must *RECOMPRESS* 800 MiB for no reason. Not the best choice. Maybe we shouldn't have deleted that packfile formatted loose object writer...
I completely agree with your argument. I do not suggest that repositories that communicate with others via packs should use this feature. Our repositories will communicate via alternates/NFS in one direction and probably packs in the other. In the latter case the packs would be generated with maxblobsize=0. See comments on next para.
Now one argument to work around that recompression problem would be to NFS share out the loose objects directory, and let clients mount that volume and add it to their .git/objects/info/alternates list. But this doesn't work in the very general distributed case, such as me getting huge files from kernel.org. Last I checked, the kernel.org admins did not offer up NSF mounts. Besides, the round-trip latency between me and kernel.org is too large for it to be useful anyway over NFS. :)
The NFS case is exactly what I want to use. I want each repo to have their own packfiles to reduce load on the central alternate, but these local repos would not include megablobs. I do not have as strong a feeling about whether the central alternate should pack its megablobs or not [but I don't want to do it if it costs me deltification for everybody], but I need a way to exclude megablobs from getting into local packs. WIth such exclusion, git-gc/repack is extremely quick. There is NO WAY this can be true if several GB have to be copied around, which again comes from Geert. I think this conversation does suggest one alteration to my patch: as submitted it writes out a loose object if the object is packed and has no loose object. It should really do this only if the object is packed LOCALLY and has no loose object.
So I think this "explode out megablobs" is a bad idea. Its violating other things that make us fast, like #3's being able to reuse large parts of an existing packfile during transfer.
As I said, your true argument doesn't apply in my case.
Dana pointed out the megablobs make access slower because their packfile indexes must still be searched to locate a commit; but if the megablob packfile(s) contain only blobs then there is no value in looking at those packfiles. We might be able to fix this by altering the sort_pack function in sha1_file.c to not only order by mtime, but also by the ratio of the size of the .pack to the number of objects stored in it. Any packfile with a high size/object ratio is likely to be what Dana has been calling a "metadata" pack, holding things like tags, commits, trees and small blobs. Its these packfiles that we want to search first, as they are the most likely to be accessed. By pushing the megablob packs to the end of our packed_git search list we won't tend to scan their indexes, as most of our objects will be found earlier in the search list. Hence we will generally avoid any costs associated with their indexes.
Good argument and I submitted a patch to do this. Let's see who chokes on the floating arithmetic ;-)
Huge packfiles probably should be scheduled for keeping with a .keep automatically. We probably should teach pack-objects to generate a .keep file if the resulting .pack was over a certain size threshold (say 1.5 GiB by default) and teach git-repack to rename the .keep file as it also renames the .idx and .pack.
I have experimented with this, and Jakub Narebski made related suggestions. I find this quite hokey, but even if I do it in my central alternate, I still do not want to be packing megablobs in individual user's repos EVER, and need some way to exclude them.
Better that we degrade gracefully when faced with massive inputs than we do something stupid by default and make the poor user pay for their mistake of not throughly reading plumbing documentation before use.
Unnecessary copying of several GB is not degrading gracefully in my view. In fact having repack.maxblobsize = 2000 (K) in the default "config" strikes me as degrading much more gracefully than what the code would currently do. This silly patch took my packfile sets from 12GB+ to 13MB, and it's difficult to describe how relieved I now feel. It's also difficult for me to believe that a setup that treats 12GB (almost) equally is going to be as efficient as one which concentrates on 13MB. That's three orders of magnitude, a point I've made before. But I think you have an understandable motivation: you want packfiles to be as good as possible, and any escape mechanism from them decreases the motivation to "fix" packing. Now I agree with this, which is why I just submitted some other patches, but I don't share your goal of the universality of all packfiles -- just the ones used for transport. Don't your packv4 plans introduce mods which won't be used for transport as well?
Now I would agree that we should punt on deltification of anything that is just too large, and let the user decide what too large means, and default it around 500 or 1024 MiB. But I would still stuff it into a packfile. Maybe it still makes sense to have a limit on the maximum size of a loose object to pack, but I think that's only to avoid the sick case of a very simple no-argument "git repack" taking a long while because there's 8 loose objects and 6 of them are 900 MiB image files.
Perhaps we _will_ make progress if we all agree to describe my situation as "sick" ;-) . In this paragraph you seem to agree that there is some argument for keeping megablobs from _entering_ packs? One reason I like my patch is because I do view megablobs as perverting the system, and just keeping them out of the optimized packfile system is a big step forward.
Once in a packfile, I'd keep it there, even if the user decreases the threshold, as the advantages of it being in the packfile outweigh the disadvantages of it being in the packfile. And there's like no advantage to being loose once packed.
To (almost) follow this suggestion I would need git-fast-import to respect repack.maxblobsize as well. Is that OK with you? I previously offered to Junio that the "write loose object" thing could be restricted: it would only happen if -f were supplied to git-repack, otherwise the bad blob would pass through to the new pack. Does this "reduction in strength" make this feature more palatable to you? If the stats on a repo change significantly, "write loose object" becomes more important if you have to make a significant reduction to repack.maxblobsize (or specify it for the first time). I don't agree that once in a packfile, a blob should stay there. Its presence is degrading access to "normal" blobs co-habiting with it. So you will want to repack to separate them in different packs (the various .keep-related ideas) or just write them out loose. To conclude: the patch wrote out a new loose object when it was previously packed and is larger then repack.maxblobsize. I could change this to only happen when the object is (1) packed AND (2) locally packed AND (3) -f/--no-object-reuse was specified to git-repack/git-pack-objects. The previous behavior that a megablob never _enters_ the pack would remain unchanged. This more restrictive behavior would be sufficient for me, and I think I *need* it at least in the users' repositories in an NFS/alternates setup. What do you think? Thanks, -- Dana L. How danahow@xxxxxxxxx +1 650 804 5991 cell - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html