Re: Problem with large files on different OSes

Jeff King <peff@xxxxxxxx> · Thu, 28 May 2009 17:21:06 -0400

On Thu, May 28, 2009 at 04:54:28PM -0400, Nicolas Pitre wrote:

> > I'm not sure what you mean by "out there", but I just exactly described
> > the data pattern of a repo I have (a few thousand 5 megapixel JPEGs and
> > short (a few dozens of megabytes) AVIs, frequent additions, infrequently
> > changing photo contents, and moderately changing metadata). I don't know
> > how that matches other peoples' needs.
> 
> How do diffing à la 'git diff' JPEGs or AVIs make sense?

It is useful to see the changes in a text representation of the
metadata, with a single-line mention if the image or movie data has
changed. It's why I wrote the textconv feature.

> Also, you certainly have little to delta against as you add new photos 
> more often than modifying existing ones?

I do add new photos more than modifying existing ones. But I do modify
the old ones (tag corrections, new tags I didn't think of initially,
updates to the tagging schema, etc), too.

The sum of the sizes for all objects in the repo is 8.3G. The fully
packed repo is 3.3G. So there clearly is some benefit from deltas, and I
don't want to just turn them off. Doing the actual repack is painfully
slow.

> I still can't see how diffing big files is useful.  Certainly you'll 
> need a specialized external diff tool, in which case it is not git's 
> problem anymore except for writing content to temporary files.

Writing content to temporarily files is actually quite slow when the
files are hundreds of megabytes (even "git show" can be painful, let
alone "git log -p"). But that is something that can be dealt with by
improving the interface to external diff and textconv to avoid writing
out the whole file (and is something I have patches in the works for,
but they need finished and cleaned up).

> Rename detection: either you deal with the big files each time, or you 
> (re)create a cache with that information so no analysis is needed the 
> second time around.  This is something that even small files might 
> possibly benefit from.  But in any case, there is no other ways but to 
> bite the bullet at least initially, and big files will be slower to 
> process no matter what.

Right. What I am proposing is basically to create such a cache. But it
is one that is general enough that it could be used for more than just
the rename detection (though arguably rename detection and deltification
could actually share more of the same techniques, in which case a cache
for one would help the other).

> Looks to me like you wish for git to do what a specialized database 
> would be much more suited for the task.  Isn't there tools to gather 
> picture metadata info, just like itunes does with MP3s already?

Yes, I already have tools for handling picture metadata info. How do I
version control that information? How do I keep it in sync across
multiple checkouts? How do I handle merging concurrent changes from
multiple sources? How do I keep that metadata connected to the pictures
that it describes? The things I want to do are conceptually
no different what I do with other files; it's merely the size of the
files that makes working with them in git less convenient (but it does
_work_; I am using git for this _now_, and I have been for a few years).

> But being able to deal with large (1GB and more) files remains a totally 
> different problem.

Right, that is why I think I will end up building on top of what you do.
I am trying to make a way for some operations to avoid looking at the
entire file, even streaming, which should drastically speed up those
operations.  But it is unavoidable that some operations (e.g., "git
add") will have to look at the entire file. And that is what your
proposal is about; streaming is basically the only way forward there.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html