On Sat, Mar 31, 2012 at 11:18:16AM -0500, Neal Kreitzinger wrote: > On 3/31/2012 6:02 AM, Sergio Callegari wrote: > >I wonder if it could make sense to have some pluggable mechanism for > > file splitting. Something under the lines of filters, so to say. > >Bupsplit can be a rather general mechanism, but large binaries that > >are containers (zip, jar, docx, tgz, pdf - seen as a collection of > >streams) may possibly be more conveniently split by their inherent > >components. > > > > gitattributes or gitconfig could configure the big-file handler for > specified files. Known/supported filetypes like gif, png, zip, pdf, > etc., could be auto-configured by git. Any > yet-unknown/yet-unsupported filetypes could be configured manually by > the user, e.g. > *.zgp=bigcontainer This is a tempting route (and one I've even suggested myself before), but I think ultimately it is a bad way to go. The problem is that splitting is only half of the equation. Once you have split contents, you have to use them intelligently, which means looking at the sha1s of each split chunk and discarding whole chunks as "the same" without even looking at the contents. Which means that it is very important that your chunking algorithm remain stable from version to version. A change in the algorithm is going to completely negate the benefits of chunking in the first place. So something configurable, or something that is not applied consistently (because it depends on each user's git config, or even on the specific version of a tool used) can end up being no help at all. Properly applied, I think a content-aware chunking algorithm could out-perform a generic one. But I think we need to first find out exactly how well the generic algorithm can perform. It may be "good enough" compared to the hassle that inconsistent application of a content-aware algorithm will cause. So I wouldn't rule it out, but I'd rather try the bup-style splitting first, and see how good (or bad) it is. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html