On Wed, Apr 28, 2010 at 11:12 AM, Sergio Callegari <sergio.callegari@xxxxxxxxx> wrote: > - storing "structured files", such as the many zip-based file formats > (Opendocument, Docx, Jar files, zip files themselves), tars (including > compressed tars), pdfs, etc, whose number is rising day after day... I'm not sure it would help very much for these sorts of files. The problem is that compressed files tend to change a lot even if only a few bytes of the original data have changed. For things like opendocument, or uncompressed tars, you'd be better off to decompress them (or recompress with zip -0) using .gitattributes. Generally these files aren't *so* large that they really need to be chunked; what you want to do is improve the deltas, which decompressing will do. > - storing binary files with textual tags, where the tags could go on a separate > blob, greatly simplifying their readout without any need for caching them on a > note tree. That sounds complicated and error prone, and is suspiciously like Apple's "resource forks," which even Apple has mostly realized were a bad idea. > - help the management of upstream trees. This could be simplified since the > "pristine tree" distributed as a tar.gz file and the exploded repo could share > their blobs making commands such as pristine-tree unnecessary. Sharing the blobs of a tarball with a checked-out tree would require a tar-specific chunking algorithm. Not impossible, but a pain, and you might have a hard time getting it accepted into git since it's obviously not something you really need for a normal "source code" tracking system. > - help projects such as bup that currently need to provide split mechanisms of > their own. Since bup is so awesome that it will soon rule the world of file splitting backup systems, and bup already has a working implemention, this reason by itself probably isn't enough to integrate the feature into git. > - be used to add "different representations" to objects... for instance, when > storing a pdf one could use a fake split to store in a separate blob the > corresponding text, making the git-diff of pdfs almost instantaneous. Aie, files that have different content depending how you look at them? You'll make a lot of enemies with such a patch :) > From Jeff's post, I guess that the major issue could be that the same file could > get a different sha1 as a multiblob versus a regular blob, but maybe it could be > possible to make the multiblob take the same sha1 of the "equivalent plain blob" > rather than its real hash. I think that's actually not a very important problem. Files that are different will still always have differing sha1s, which is the important part. Files that are the same might not have the same sha1, which is a bit weird, but it's unlikely that any algorithm in git depends fundamentally on the fact that the sha1s match. Storing files as split does have a lot of usefulness for calculating diffs, however: because you can walk through the tree of hashes and short entire circuit subtrees with identical sha1s, you can diff even 20GB files really rapidly. > For the moment, I am just very curious about the idea and the possible pros and > cons... can someone (maybe Jeff himself) tell me a little more? Also I wonder > about the two possibilities (implement it in git vs implement it "on top of" > git). "on top of" git has one major advantage, which is that it's easy: for example, bup already does it. The disadvantage is that checking out the resulting repository won't be smart enough to re-merge the data again, so you have a bunch of tiny chunk files you have to concatenate by hand. Implementing inside git could be done in one of two ways: add support for a new 'multiblob' data type (which is really more like a tree object, but gets checked out as a single file), or implement chunking at the packfile level, so that higher-level tools never have to know about multiblobs. The latter would probably be easier and more backward-compatibility, but you'd probably lose the ability to do really fast diffs between multiblobs, since diff happens at the higher level. Overall, I'm not sure git would benefit much from supporting large files in this way; at least not yet. As soon as you supported this, you'd start running into other problems... such as the fact that shallow repos don't really work very well, and you obviously don't want to clone every single copy of a 100MB file just so you can edit the most recent version. So you might want to make sure shallow repos / sparse checkouts are fully up to speed first. Have fun, Avery -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html