Re: GSoC - Some questions on the idea of

Bo Chen <chen@xxxxxxxxxxxxxx> · Fri, 30 Mar 2012 15:44:03 -0400

The following is the list of sub-problems according to my
understanding of the "big file support" problem. Can anyone give some
feed back and help refine it. Thanks.

            ---- text file (always delta well? need to be confirmed)
             |

                                               --- delta well (ok)
large file-|                    ----    general binary file (without
encryption, compression. Other cases which definitely can not delta
well)  -|
             |                     |

                                               --- does not delta well
(improvement?)
            ---- binary file   -|---   encrypted file (improvement?
one straightforward method is to decrypt the file before delta-ing it,
however, we don't always have the key for decryption. Other?)
                                   |
                                  ---    compressed file (improvement?
Decompress before delta-ing it? Other?)

Bo

On Wed, Mar 28, 2012 at 7:33 AM, Sergio <sergio.callegari@xxxxxxxxx> wrote:
> Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:
>
>>
>> On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen <at> chenirvine.org> wrote:
>> > Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better
>> > big-file support".
>> >
>> > As it is described in the idea page,
>> > "Many large files (like media) do not delta very well. However, some
>> > do (like VM disk images). Git could split large objects into smaller
>> > chunks, similar to bup, and find deltas between these much more
>> > manageable chunks. There are some preliminary patches in this
>> > direction, but they are in need of review and expansion."
>> >
>> > Can anyone elaborate a little bit why many large files do not delta
>> > very well?
>>
>> Large files are usually binary. Depends on the type of binary, they
>> may or may not delta well. Those that are compressed/encrypted
>> obviously don't delta well because one change can make the final
>> result completely different.
>
> I would add that the larger a file, the larger the temptation to use a
> compressed format for it, so that large files are often compressed binaries.
>
> For these, a trick to obtain good deltas can be to decompress before splitting
> in chunks with the rsync algorithm. Git filters can already be used for this,
> but it can be tricky to assure that the decompress - recompress roundtrip
> re-creates the original compressed file.
>
> Furhermore, some compressed binaries are internally composed by multiple streams
> (think of a zip archive containing multiple files, but this is by no means
> limited to zip). In this case, it is frequent to have many possible orderings of
> the streams. If so, the best deltas can be obtained by sorting the streams in
> some 'canonical' order and decompressing. Even without decompressing, sorting
> alone can obtain good results as long as changes are only due to changes in a
> single stream of the container. Personally, I know no example of git filters
> used to perform this sorting which can be extremely tricky in assuring the
> possibility of recovering the file in the original stream order.
>
> Maybe (but this is just speculation), once the bup-inspired file chunking
> support is in place, people will start contributing filters to improve the
> management of many types of standard files (obviously 'improve' in terms of
> space efficiency as filters can be quite slow).
>
> Sergio
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html