Re: GSoC - Some questions on the idea of

Sergio Callegari <sergio.callegari@xxxxxxxxx> · Tue, 03 Apr 2012 11:58:58 +0200

On 02/04/2012 23:07, Jeff King wrote:
gitattributes or gitconfig could configure the big-file handler for
specified files.  Known/supported filetypes like gif, png, zip, pdf,
etc., could be auto-configured by git.  Any
yet-unknown/yet-unsupported filetypes could be configured manually by
the user, e.g.
*.zgp=bigcontainer
This is a tempting route (and one I've even suggested myself before),
but I think ultimately it is a bad way to go. The problem is that
splitting is only half of the equation. Once you have split contents,
you have to use them intelligently, which means looking at the sha1s of
each split chunk and discarding whole chunks as "the same" without even
looking at the contents.

Which means that it is very important that your chunking algorithm
remain stable from version to version. A change in the algorithm is
going to completely negate the benefits of chunking in the first place.
So something configurable, or something that is not applied consistently
(because it depends on each user's git config, or even on the specific
version of a tool used) can end up being no help at all.
Isn't this the same with filters? The clean algorithms should remain 
stable from
version to version. Filters are often perceived as simpler, so that this 
stability seems easier to achieve, but it is not necessarily the case.
Properly applied, I think a content-aware chunking algorithm could
out-perform a generic one. But I think we need to first find out exactly
how well the generic algorithm can perform. It may be "good enough"
compared to the hassle that inconsistent application of a content-aware
algorithm will cause.
Absolutely true, but why not giving freedom to the user to chose? Git 
could provide the bupsplit mechanism and at the same time have a means 
so that the user can plug in a different machinery for specific file 
types.  In this case, it is the user responsibility to do it right.

One could have a special 'filter' for splitting/unsplitting. Say

[splitfilter "XXX"]
    split = xxx
    unsplit = uxxx

xxx is given the file to split on stdin and returns on stdout a stream 
made of an index header and the concatenation of the parts in which the 
file should be split. For unsplitting uxxx is given on stdin the index 
and the concatenation of parts and returns on stdout the binary file.

bupsplit and bupunsplit could be built in, with other tools being user 
provided.  If the users gets them wrong it is ultimately his/her 
responsibility. In the end, the user is given even 'rm' isn't he/she? 
Git could provide a header file defining the index header format to help 
the coding of the alternative, more specific splitters. If people devise 
some of them that look promising, they can probably be collected in contrib.

Possibly, the index header could comprise starting positions for the 
various parts in the stream, but also 'names' for them. This would let 
reusing blob and tree objects to physically store the various parts. For 
bupsplit, names could be flat (e.g. sequence numbers like 0000, 0001). 
For files that are container, they could reflect the inner names. 
Perspectively, one could even devise specific diff tools for these 
'special' trees of split-object components. With this, when storing say 
a very large zip file in git, these tools could help saying things like 
'from version x to version y, only that specific part in the zip file 
has changed'.

Sergio
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html