Re: Problem with large files on different OSes

Nicolas Pitre <nico@xxxxxxx> · Thu, 28 May 2009 16:54:28 -0400 (EDT)

On Thu, 28 May 2009, Jeff King wrote:

> On Wed, May 27, 2009 at 07:29:02PM -0400, Nicolas Pitre wrote:
> 
> > > What about large files that have a short metadata section that may
> > > change? Versions with only the metadata changed delta well, and with a
> > > custom diff driver, can produce useful diffs. And I don't think that is
> > > an impractical or unlikely example; large files can often be tagged
> > > media.
> > 
> > Sure... but what is the actual data pattern currently used out there?
> 
> I'm not sure what you mean by "out there", but I just exactly described
> the data pattern of a repo I have (a few thousand 5 megapixel JPEGs and
> short (a few dozens of megabytes) AVIs, frequent additions, infrequently
> changing photo contents, and moderately changing metadata). I don't know
> how that matches other peoples' needs.

How do diffing à la 'git diff' JPEGs or AVIs make sense?

Also, you certainly have little to delta against as you add new photos 
more often than modifying existing ones?

> Game designers have been mentioned before in large media checkins, and I
> think they focus less on metadata changes. Media is either there to
> stay, or it is replaced as a whole.

Right.  And my proposal fits that scenario pretty well.

> > What does P4 or CVS or SVN do with multiple versions of almost identical 
> > 2GB+ files?
> 
> I only ever tried this with CVS, which just stored the entire binary
> version as a whole. And of course running "diff" was useless, but then
> it was also useless on text files. ;) I suspect CVS would simply choke
> on a 2G file.
> 
> But I don't want to do as well as those other tools. I want to be able
> to do all of the useful things git can do but with large files.

Right now git simply do much worse.  So doing as well is still a worthy 
goal.

> Right. I think in some ways we are perhaps talking about two different
> problems. I am really interested in moderately large files (a few
> megabytes up to a few dozens or even a hundred megabytes), but I want
> git to be _fast_ at dealing with them, and doing useful operations on
> them (like rename detection, diffing, etc).

I still can't see how diffing big files is useful.  Certainly you'll 
need a specialized external diff tool, in which case it is not git's 
problem anymore except for writing content to temporary files.

Rename detection: either you deal with the big files each time, or you 
(re)create a cache with that information so no analysis is needed the 
second time around.  This is something that even small files might 
possibly benefit from.  But in any case, there is no other ways but to 
bite the bullet at least initially, and big files will be slower to 
process no matter what.

> A smart splitter would probably want to mark part of the split as "this
> section is large and uninteresting for compression, deltas, diffing, and
> renames".  And that half may be stored in the way that you are proposing
> (in a separate single-object pack, no compression, no delta, etc). So in
> a sense I think what I am talking about would built on top of what you
> want to do.

Looks to me like you wish for git to do what a specialized database 
would be much more suited for the task.  Isn't there tools to gather 
picture metadata info, just like itunes does with MP3s already?

> But I do think we can get some of the benefits by maintaining a split
> cache for viewers.

Sure.

But being able to deal with large (1GB and more) files remains a totally 
different problem.

Nicolas