Re: GSoC - Some questions on the idea of

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Mar 31, 2012 at 11:18:16AM -0500, Neal Kreitzinger wrote:

> On 3/31/2012 6:02 AM, Sergio Callegari wrote:
> >I wonder if it could make sense to have some pluggable mechanism for
> > file splitting. Something under the lines of filters, so to say.
> >Bupsplit can be a rather general mechanism, but large binaries that
> >are containers (zip, jar, docx, tgz, pdf - seen as a collection of
> >streams) may possibly be more conveniently split by their inherent
> >components.
> >
> 
> gitattributes or gitconfig could configure the big-file handler for
> specified files.  Known/supported filetypes like gif, png, zip, pdf,
> etc., could be auto-configured by git.  Any
> yet-unknown/yet-unsupported filetypes could be configured manually by
> the user, e.g.
> *.zgp=bigcontainer

This is a tempting route (and one I've even suggested myself before),
but I think ultimately it is a bad way to go. The problem is that
splitting is only half of the equation. Once you have split contents,
you have to use them intelligently, which means looking at the sha1s of
each split chunk and discarding whole chunks as "the same" without even
looking at the contents.

Which means that it is very important that your chunking algorithm
remain stable from version to version. A change in the algorithm is
going to completely negate the benefits of chunking in the first place.
So something configurable, or something that is not applied consistently
(because it depends on each user's git config, or even on the specific
version of a tool used) can end up being no help at all.

Properly applied, I think a content-aware chunking algorithm could
out-perform a generic one. But I think we need to first find out exactly
how well the generic algorithm can perform. It may be "good enough"
compared to the hassle that inconsistent application of a content-aware
algorithm will cause.  So I wouldn't rule it out, but I'd rather try the
bup-style splitting first, and see how good (or bad) it is.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]