Re: Multiblobs

Avery Pennarun <apenwarr@xxxxxxxxx> · Wed, 28 Apr 2010 14:07:02 -0400

On Wed, Apr 28, 2010 at 11:12 AM, Sergio Callegari
<sergio.callegari@xxxxxxxxx> wrote:
> - storing "structured files", such as the many zip-based file formats
> (Opendocument, Docx, Jar files, zip files themselves), tars (including
> compressed tars), pdfs, etc, whose number is rising day after day...

I'm not sure it would help very much for these sorts of files.  The
problem is that compressed files tend to change a lot even if only a
few bytes of the original data have changed.

For things like opendocument, or uncompressed tars, you'd be better
off to decompress them (or recompress with zip -0) using
.gitattributes.  Generally these files aren't *so* large that they
really need to be chunked; what you want to do is improve the deltas,
which decompressing will do.

> - storing binary files with textual tags, where the tags could go on a separate
> blob, greatly simplifying their readout without any need for caching them on a
> note tree.

That sounds complicated and error prone, and is suspiciously like
Apple's "resource forks," which even Apple has mostly realized were a
bad idea.

> - help the management of upstream trees. This could be simplified since the
> "pristine tree" distributed as a tar.gz file and the exploded repo could share
> their blobs making commands such as pristine-tree unnecessary.

Sharing the blobs of a tarball with a checked-out tree would require a
tar-specific chunking algorithm.  Not impossible, but a pain, and you
might have a hard time getting it accepted into git since it's
obviously not something you really need for a normal "source code"
tracking system.

> - help projects such as bup that currently need to provide split mechanisms of
> their own.

Since bup is so awesome that it will soon rule the world of file
splitting backup systems, and bup already has a working implemention,
this reason by itself probably isn't enough to integrate the feature
into git.

> - be used to add "different representations" to objects... for instance, when
> storing a pdf one could use a fake split to store in a separate blob the
> corresponding text, making the git-diff of pdfs almost instantaneous.

Aie, files that have different content depending how you look at them?
 You'll make a lot of enemies with such a patch :)

> From Jeff's post, I guess that the major issue could be that the same file could
> get a different sha1 as a multiblob versus a regular blob, but maybe it could be
> possible to make the multiblob take the same sha1 of the "equivalent plain blob"
> rather than its real hash.

I think that's actually not a very important problem.  Files that are
different will still always have differing sha1s, which is the
important part.  Files that are the same might not have the same sha1,
which is a bit weird, but it's unlikely that any algorithm in git
depends fundamentally on the fact that the sha1s match.

Storing files as split does have a lot of usefulness for calculating
diffs, however: because you can walk through the tree of hashes and
short entire circuit subtrees with identical sha1s, you can diff even
20GB files really rapidly.

> For the moment, I am just very curious about the idea and the possible pros and
> cons... can someone (maybe Jeff himself) tell me a little more? Also I wonder
> about the two possibilities (implement it in git vs implement it "on top of"
> git).

"on top of" git has one major advantage, which is that it's easy: for
example, bup already does it.  The disadvantage is that checking out
the resulting repository won't be smart enough to re-merge the data
again, so you have a bunch of tiny chunk files you have to concatenate
by hand.

Implementing inside git could be done in one of two ways: add support
for a new 'multiblob' data type (which is really more like a tree
object, but gets checked out as a single file), or implement chunking
at the packfile level, so that higher-level tools never have to know
about multiblobs.

The latter would probably be easier and more backward-compatibility,
but you'd probably lose the ability to do really fast diffs between
multiblobs, since diff happens at the higher level.

Overall, I'm not sure git would benefit much from supporting large
files in this way; at least not yet.  As soon as you supported this,
you'd start running into other problems... such as the fact that
shallow repos don't really work very well, and you obviously don't
want to clone every single copy of a 100MB file just so you can edit
the most recent version.  So you might want to make sure shallow repos
/ sparse checkouts are fully up to speed first.

Have fun,

Avery
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html