On Apr 28, 2010, at 11:12, Sergio Callegari wrote: > Hi, > > it happened to me to read an older post by Jeff King about "multiblobs" > (http://kerneltrap.org/mailarchive/git/2008/4/6/1360014) and I was wandering > whether the idea has been abandoned for some reason or just put on hold. > > Apparently, this would marvellously help on > - storing large binary blobs (the split could happen with a rolling checksum > approach) > - storing "structured files", such as the many zip-based file formats > (Opendocument, Docx, Jar files, zip files themselves), tars (including > compressed tars), pdfs, etc, whose number is rising day after day... > - storing binary files with textual tags, where the tags could go on a separate > blob, greatly simplifying their readout without any need for caching them on a > note tree. > - etc... In the early days of GIT I once implemented a "git pipe" command that would allow an unbounded stream of data to be stored in GIT. The stream would be broken up in small segments using context-sensitive break points (essentially points in the code where a hash H of the last N bytes modulo P is equal to some Q). The average segment length will then be about P bytes long. Multiple segments would be put in a tree with each tree entry's name being the cumulative length of the segment or subtree it references, with enough leading zeros to accomodate for the largest length in the tree. This works well and allows efficient diff operations or updates of arbitrarily large files. In particular, all operations take a time proportional to the size of the change rather than the size of the file. The draw backs are: - All of the variables H, N, P and Q above influence the final hash that is computed for an object, so the values picked must work well. - You'd only want to use this method for largish files, but because this threshold influences final hashes, it again should be picked with care. - more complex than having just simple straight blobs. One of the nice aspects of this representation is that extracting the tree into the local filesystem and concatenating all files in the directory tree in alphabetical order does yield the original file. -Geert-- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html