On Wed, 27 May 2009, Jeff King wrote: > On Wed, May 27, 2009 at 01:37:26PM -0400, Nicolas Pitre wrote: > > > My idea for handling big files is simply to: > > > > 1) Define a new parameter to determine what is considered a big file. > > > > 2) Store any file larger than the treshold defined in (1) directly into > > a pack of their own at "git add" time. > > > > 3) Never attempt to diff nor delta large objects, again according to > > (1) above. It is typical for large files not to be deltifiable, and > > a diff for files in the thousands of megabytes cannot possibly be > > sane. > > What about large files that have a short metadata section that may > change? Versions with only the metadata changed delta well, and with a > custom diff driver, can produce useful diffs. And I don't think that is > an impractical or unlikely example; large files can often be tagged > media. Sure... but what is the actual data pattern currently used out there? What does P4 or CVS or SVN do with multiple versions of almost identical 2GB+ files? My point is, if the tool people are already using with gigantic repositories is not bothering with delta compression then we don't lose much in making git usable with those repositories by doing the same. And this can be achieved pretty easily with fairly minor changes. Plus, my proposal doesn't introduce any incompatibility in the git repository format while not denying possible future enhancements. For example, that would be non trivial but still doable to make git work on data streams instead of buffers. The current code for blob read/write/delta could be kept for performance, along with another version in parallel doing the same but with file descriptors and pread/pwrite for big files. > Linus' "split into multiple objects" approach means you could perhaps > split intelligently into metadata and "uninteresting data" sections > based on the file type. That would make things like rename detection > very fast. Of course it has the downside that you are cementing whatever > split you made into history for all time. And it means that two people > adding the same content might end up with different trees. Both things > that git tries to avoid. Exact. And honnestly I don't think that would be worth trying to do inexact rename detection for huge files anyway. It is rarely the case that moving/renaming a movie file needs to change its content in some way. > I wonder if it would be useful to make such a split at _read_ time. That > is, still refer to the sha-1 of the whole content in the tree objects, > but have a separate cache that says "hash X splits to the concatenation > of Y,Z". Thus you can always refer to the "pure" object, both as a user, > and in the code. So we could avoid retrofitting all of the code -- just > some parts like diff might want to handle an object in multiple > segments. Unless there are real world scenarios where diffing (as we know it) two huge files is a common and useful operation, I don't think we should even try to consider that problem. What people are doing with huge files is storing them and retrieving them, so we probably should only limit ourselves to making those operations work for now. And to that effect I don't think it would be wise to introduce artificial segmentations in the object structure that would make both the code and the git model more complex. We could just as well limit the complexity to the code for dealing with blobs without having to load them all in memory at once instead and keep the git repository model simple. So if we want to do the real thing and deal with huge blobs, there is only a small set of operations that need to be considered: - Creation of new blobs (or "git add") for huge files: can be done trivially in chunks. Open issue is whether or not the SHA1 of the file is computed with a first pass over the file, and if the object doesn't exist then perform a second pass to deflate it if desired, or do both the SHA1 summing and deflate in the same pass and discard the result if the object happens to already exist. Still trivial to implement. - Checkout of huge files: still trivial to perform if non delta. In the delta case, that _could_ be quite simple if the base objects were not deflated by recursively parsing deltas. But again that remains to be seen if 1) deflating or even 2) deltifying huge files is useful in practice with real world data. - repack/fetch/pull: In the pack data reuse case, the code is already fine as it streams small blocks from the source to the destination. Delta compression can be done by using coarse indexing of the source object and loading/discarding portions of the source data while the target object is processed in a streaming fashion. Other than that, I don't see how git could be useful for huge files. The above operations (read/write/delta of huge blobs) would need to be done with a separate set of functions, and a configurable size treshold would select the regular or the chunked set. Nothing fundamentally difficult in my mind. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html