Re: Multiblobs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Apr 28, 2010, at 11:12, Sergio Callegari wrote:

> Hi,
> 
> it happened to me to read an older post by Jeff King about "multiblobs"
> (http://kerneltrap.org/mailarchive/git/2008/4/6/1360014) and I was wandering
> whether the idea has been abandoned for some reason or just put on hold.
> 
> Apparently, this would marvellously help on
> - storing large binary blobs (the split could happen with a rolling checksum
> approach)
> - storing "structured files", such as the many zip-based file formats
> (Opendocument, Docx, Jar files, zip files themselves), tars (including
> compressed tars), pdfs, etc, whose number is rising day after day...
> - storing binary files with textual tags, where the tags could go on a separate
> blob, greatly simplifying their readout without any need for caching them on a
> note tree.
> - etc...

In the early days of GIT I once implemented a "git pipe" command that would
allow an unbounded stream of data to be stored in GIT. The stream would be
broken up in small segments using context-sensitive break points (essentially
points in the code where a hash H of the last N bytes modulo P is equal to some Q).
The average segment length will then be about P bytes long.
Multiple segments would be put in a tree with each tree entry's name being the
cumulative length of the segment or subtree it references, with enough leading
zeros to accomodate for the largest length in the tree.

This works well and allows efficient diff operations or updates of arbitrarily
large files. In particular, all operations take a time proportional to the
size of the change rather than the size of the file.

The draw backs are:

  - All of the variables H, N, P and Q above influence the final hash
    that is computed for an object, so the values picked must work well.
  - You'd only want to use this method for largish files, but because
    this threshold influences final hashes, it again should be picked with care.
  - more complex than having just simple straight blobs.

One of the nice aspects of this representation is that extracting the tree
into the local filesystem and concatenating all files in the directory
tree in alphabetical order does yield the original file.

  -Geert--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]