On Wed, Aug 24, 2011 at 17:17, Junio C Hamano <gitster@xxxxxxxxx> wrote: > Jeff King <peff@xxxxxxxx> writes: > >> I don't remember all of the details of bup, but if it's possible to >> implement something similar at a lower level (i.e., at the layer of >> packfiles or object storage), then it can be a purely local thing, and >> the compatibility issues can go away. > > I tend to agree, and we might be closer than we realize. > > I suspect that people with large binary assets were scared away by rumors > they heard second-hand, based on bad experiences other people had before > any of the recent efforts made in various "large Git" topics, and they > themselves haven't tried recent versions of Git enough to be able to tell > what the remaining pain points are. I wouldn't be surprised if none of the > core Git people tried shoving huge binary assets in test repositories with > recent versions of Git---I certainly haven't. > > We used to always map the blob data as a whole for anything we do, but > these days, with changes like your abb371a (diff: don't retrieve binary > blobs for diffstat, 2011-02-19) and my recent "send large blob straight to > a new pack" and "stream large data out to the working tree without holding > everything in core while checking out" topics, I suspect that the support > for local usage of large blobs might be sufficiently better than the old > days. Git might even be usable locally without anything else, which I find > implausible, but I wouldn't be surprised if there remained only a handful > minor things remaining that we need to add to make it usable. > > People toyed around with ideas to have a separate object store > representation for large and possibly incompressible blobs (a possible > complaint being that it is pointless to send them even to its own > packfile). One possible implementation would be to add a new huge > hierarchy under $GIT_DIR/objects/, compute the object name exactly the > same way for huge blobs as we normally would (i.e. hash concatenation of > object header and then contents) to decide which subdirectory under the > "huge" hierarchy to store the data (huge/[0-9a-f]{2}/[0-9a-f]{38}/ like we > do for loose objects, or perhaps huge/[0-9a-f]{40}/ expecting that there > won't be very many). The data can be stored unmodified as a file in that > directory, with type stored in a separate file---that way, we won't have > to compress, but we just copy. You still need to hash it at least once to > come up with the object name, but that is what gives us integrity checks, > is unavoidable and is not going to change. > > The sha1_object_info() layer can learn to return the type and size from > such a representation, and you can further tweak the same places as the > "streaming checkout" and the "checkin to a pack" topics touched to support > such a representation. > > I would suspect that the local object representation is _not_ the largest > pain point; such a separate object store representation is not buying us > very much over a simpler "single large blob in a separate packfile", and > if the counter-argument is "no, decompressing still costs a lot", then the > real issue might be we decompress and look at the data when we do not have > to (i.e. issues similar to what abb371a addressed), not "decompress vs > straight copy make a bit difference". I've added Avery to the Cc list, because he really needs to chime in here. I am completely unqualified to make a comment about this, but I think that it would be silly to ignore the insights that Avery has about storing large objects; `bup' uses rolling checksums and a `bloom filter' implementation and who knows what else. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html