Jeff King <peff@xxxxxxxx> writes: > I don't remember all of the details of bup, but if it's possible to > implement something similar at a lower level (i.e., at the layer of > packfiles or object storage), then it can be a purely local thing, and > the compatibility issues can go away. I tend to agree, and we might be closer than we realize. I suspect that people with large binary assets were scared away by rumors they heard second-hand, based on bad experiences other people had before any of the recent efforts made in various "large Git" topics, and they themselves haven't tried recent versions of Git enough to be able to tell what the remaining pain points are. I wouldn't be surprised if none of the core Git people tried shoving huge binary assets in test repositories with recent versions of Git---I certainly haven't. We used to always map the blob data as a whole for anything we do, but these days, with changes like your abb371a (diff: don't retrieve binary blobs for diffstat, 2011-02-19) and my recent "send large blob straight to a new pack" and "stream large data out to the working tree without holding everything in core while checking out" topics, I suspect that the support for local usage of large blobs might be sufficiently better than the old days. Git might even be usable locally without anything else, which I find implausible, but I wouldn't be surprised if there remained only a handful minor things remaining that we need to add to make it usable. People toyed around with ideas to have a separate object store representation for large and possibly incompressible blobs (a possible complaint being that it is pointless to send them even to its own packfile). One possible implementation would be to add a new huge hierarchy under $GIT_DIR/objects/, compute the object name exactly the same way for huge blobs as we normally would (i.e. hash concatenation of object header and then contents) to decide which subdirectory under the "huge" hierarchy to store the data (huge/[0-9a-f]{2}/[0-9a-f]{38}/ like we do for loose objects, or perhaps huge/[0-9a-f]{40}/ expecting that there won't be very many). The data can be stored unmodified as a file in that directory, with type stored in a separate file---that way, we won't have to compress, but we just copy. You still need to hash it at least once to come up with the object name, but that is what gives us integrity checks, is unavoidable and is not going to change. The sha1_object_info() layer can learn to return the type and size from such a representation, and you can further tweak the same places as the "streaming checkout" and the "checkin to a pack" topics touched to support such a representation. I would suspect that the local object representation is _not_ the largest pain point; such a separate object store representation is not buying us very much over a simpler "single large blob in a separate packfile", and if the counter-argument is "no, decompressing still costs a lot", then the real issue might be we decompress and look at the data when we do not have to (i.e. issues similar to what abb371a addressed), not "decompress vs straight copy make a bit difference". I would further suspect that we _might_ need a better support for local repacking and object transfer, with or without such a third object representation. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html