On Wed, Aug 24, 2011 at 10:17:49AM -0700, Junio C Hamano wrote: > I suspect that people with large binary assets were scared away by rumors > they heard second-hand, based on bad experiences other people had before > any of the recent efforts made in various "large Git" topics, and they > themselves haven't tried recent versions of Git enough to be able to tell > what the remaining pain points are. I wouldn't be surprised if none of the > core Git people tried shoving huge binary assets in test repositories with > recent versions of Git---I certainly haven't. I haven't tried anything really big in a while. My personal interest in big file support has been: 1. Mid-sized photos and videos (objects top out around 50M, total repo size is 4G packed). Most commits are additions or tweaks of exif tags (so they delta well). Using gitattributes (and especially textconv caching), it's really quite pleasant to use. Doing a full repack is my only complaint; the delta-compression isn't bad, but just the I/O on rewriting the whole thing is a killer. 2. Storing an entire audio collection in flac. Median file size is only around 20M, but the whole repo is 120G. Obviously compression doesn't buy much, so a git repo plus checkout is 240G, which is pretty hefty for most laptops. I played with this early on, but gave up; the data storage model just doesn't make sense. The two common use cases that aren't represented here are: 3. Big files, not just big repos. I.e., files that are 1G or more. 4. Medium-big files that don't delta well (e.g., metadata tweaks do delta well; rewriting media assets for a game don't delta well). I think recent changes (like putting big files straight to packs) make (3) and (4) reasonably pleasant. I'm not sure of the right answer for (1). The repack is the only annoying thing. But not repacking is not satisfying, either. You don't get deltas where they are applicable, and the server is always re-examining the pack for possible deltas on fetch and push. Some sort of hybrid loose-pack storage would be nice: store delta chains for big files in their own individual packs, but otherwise keep everything in a separate pack. We would want some kind of meta-index over all of these little pack-files, not just individual pack-file indices. But (2) is the hardest one. It would be nice if we had some kind of local-remote hybrid storage, where objects were fetched on demand from somewhere else. For example, developers on workstations with a fast local network to a storage server wouldn't have to replicate all of the objects locally. And for a true distributed setup, when the fast network isn't there, it would be nice to fail gracefully (which maybe just means saying "sorry, we can't do 'log -p' right now; try 'log --raw'"). I wonder how close one can get on (2) using alternates and a network-mounted filesystem. > People toyed around with ideas to have a separate object store > representation for large and possibly incompressible blobs (a possible > complaint being that it is pointless to send them even to its own > packfile). One possible implementation would be to add a new huge > hierarchy under $GIT_DIR/objects/, compute the object name exactly the > same way for huge blobs as we normally would (i.e. hash concatenation of > object header and then contents) to decide which subdirectory under the > "huge" hierarchy to store the data (huge/[0-9a-f]{2}/[0-9a-f]{38}/ like we > do for loose objects, or perhaps huge/[0-9a-f]{40}/ expecting that there > won't be very many). The data can be stored unmodified as a file in that > directory, with type stored in a separate file---that way, we won't have > to compress, but we just copy. You still need to hash it at least once to > come up with the object name, but that is what gives us integrity checks, > is unavoidable and is not going to change. Yeah. I think one of the bonuses there is that some filesystems are capable of referencing the same inodes in a copy-on-write way, so "add" and "checkout" cease to be a copy operation, but rather an inode-linking operation. Which is a big win, both for speed and storage. I've had dreams of using hard-linking to do something similar, but it's just not safe enough without some filesystem-level copy-on-write protection. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html