On Wed, May 27, 2009 at 07:29:02PM -0400, Nicolas Pitre wrote: > > What about large files that have a short metadata section that may > > change? Versions with only the metadata changed delta well, and with a > > custom diff driver, can produce useful diffs. And I don't think that is > > an impractical or unlikely example; large files can often be tagged > > media. > > Sure... but what is the actual data pattern currently used out there? I'm not sure what you mean by "out there", but I just exactly described the data pattern of a repo I have (a few thousand 5 megapixel JPEGs and short (a few dozens of megabytes) AVIs, frequent additions, infrequently changing photo contents, and moderately changing metadata). I don't know how that matches other peoples' needs. Game designers have been mentioned before in large media checkins, and I think they focus less on metadata changes. Media is either there to stay, or it is replaced as a whole. > What does P4 or CVS or SVN do with multiple versions of almost identical > 2GB+ files? I only ever tried this with CVS, which just stored the entire binary version as a whole. And of course running "diff" was useless, but then it was also useless on text files. ;) I suspect CVS would simply choke on a 2G file. But I don't want to do as well as those other tools. I want to be able to do all of the useful things git can do but with large files. > My point is, if the tool people are already using with gigantic > repositories is not bothering with delta compression then we don't lose > much in making git usable with those repositories by doing the same. > And this can be achieved pretty easily with fairly minor changes. Plus, > my proposal doesn't introduce any incompatibility in the git repository > format while not denying possible future enhancements. Right. I think in some ways we are perhaps talking about two different problems. I am really interested in moderately large files (a few megabytes up to a few dozens or even a hundred megabytes), but I want git to be _fast_ at dealing with them, and doing useful operations on them (like rename detection, diffing, etc). A smart splitter would probably want to mark part of the split as "this section is large and uninteresting for compression, deltas, diffing, and renames". And that half may be stored in the way that you are proposing (in a separate single-object pack, no compression, no delta, etc). So in a sense I think what I am talking about would built on top of what you want to do. > > very fast. Of course it has the downside that you are cementing whatever > > split you made into history for all time. And it means that two people > > adding the same content might end up with different trees. Both things > > that git tries to avoid. > > Exact. And honnestly I don't think that would be worth trying to do > inexact rename detection for huge files anyway. It is rarely the case > that moving/renaming a movie file needs to change its content in some > way. I should have been more clear in my other email: I think splitting that is represented in the actual git trees is not going to be worth the hassle. But I do think we can get some of the benefits by maintaining a split cache for viewers. And again, maybe my use case is crazy, but in my repo I have renames and metadata content changes together. > Unless there are real world scenarios where diffing (as we know it) two > huge files is a common and useful operation, I don't think we should > even try to consider that problem. What people are doing with huge > files is storing them and retrieving them, so we probably should only > limit ourselves to making those operations work for now. Again, this is motivated by a real use case that I have. > And to that effect I don't think it would be wise to introduce > artificial segmentations in the object structure that would make both > the code and the git model more complex. We could just as well limit > the complexity to the code for dealing with blobs without having to load > them all in memory at once instead and keep the git repository model > simple. I do agree with this; I don't want to make any changes to the repository model. > So if we want to do the real thing and deal with huge blobs, there is > only a small set of operations that need to be considered: I think everything you say here is sensible; I just want more operations for my use case. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html