On Thu, May 28, 2009 at 04:54:28PM -0400, Nicolas Pitre wrote: > > I'm not sure what you mean by "out there", but I just exactly described > > the data pattern of a repo I have (a few thousand 5 megapixel JPEGs and > > short (a few dozens of megabytes) AVIs, frequent additions, infrequently > > changing photo contents, and moderately changing metadata). I don't know > > how that matches other peoples' needs. > > How do diffing à la 'git diff' JPEGs or AVIs make sense? It is useful to see the changes in a text representation of the metadata, with a single-line mention if the image or movie data has changed. It's why I wrote the textconv feature. > Also, you certainly have little to delta against as you add new photos > more often than modifying existing ones? I do add new photos more than modifying existing ones. But I do modify the old ones (tag corrections, new tags I didn't think of initially, updates to the tagging schema, etc), too. The sum of the sizes for all objects in the repo is 8.3G. The fully packed repo is 3.3G. So there clearly is some benefit from deltas, and I don't want to just turn them off. Doing the actual repack is painfully slow. > I still can't see how diffing big files is useful. Certainly you'll > need a specialized external diff tool, in which case it is not git's > problem anymore except for writing content to temporary files. Writing content to temporarily files is actually quite slow when the files are hundreds of megabytes (even "git show" can be painful, let alone "git log -p"). But that is something that can be dealt with by improving the interface to external diff and textconv to avoid writing out the whole file (and is something I have patches in the works for, but they need finished and cleaned up). > Rename detection: either you deal with the big files each time, or you > (re)create a cache with that information so no analysis is needed the > second time around. This is something that even small files might > possibly benefit from. But in any case, there is no other ways but to > bite the bullet at least initially, and big files will be slower to > process no matter what. Right. What I am proposing is basically to create such a cache. But it is one that is general enough that it could be used for more than just the rename detection (though arguably rename detection and deltification could actually share more of the same techniques, in which case a cache for one would help the other). > Looks to me like you wish for git to do what a specialized database > would be much more suited for the task. Isn't there tools to gather > picture metadata info, just like itunes does with MP3s already? Yes, I already have tools for handling picture metadata info. How do I version control that information? How do I keep it in sync across multiple checkouts? How do I handle merging concurrent changes from multiple sources? How do I keep that metadata connected to the pictures that it describes? The things I want to do are conceptually no different what I do with other files; it's merely the size of the files that makes working with them in git less convenient (but it does _work_; I am using git for this _now_, and I have been for a few years). > But being able to deal with large (1GB and more) files remains a totally > different problem. Right, that is why I think I will end up building on top of what you do. I am trying to make a way for some operations to avoid looking at the entire file, even streaming, which should drastically speed up those operations. But it is unavoidable that some operations (e.g., "git add") will have to look at the entire file. And that is what your proposal is about; streaming is basically the only way forward there. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html