The following is the list of sub-problems according to my understanding of the "big file support" problem. Can anyone give some feed back and help refine it. Thanks. ---- text file (always delta well? need to be confirmed) | --- delta well (ok) large file-| ---- general binary file (without encryption, compression. Other cases which definitely can not delta well) -| | | --- does not delta well (improvement?) ---- binary file -|--- encrypted file (improvement? one straightforward method is to decrypt the file before delta-ing it, however, we don't always have the key for decryption. Other?) | --- compressed file (improvement? Decompress before delta-ing it? Other?) Bo On Wed, Mar 28, 2012 at 7:33 AM, Sergio <sergio.callegari@xxxxxxxxx> wrote: > Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes: > >> >> On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen <at> chenirvine.org> wrote: >> > Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better >> > big-file support". >> > >> > As it is described in the idea page, >> > "Many large files (like media) do not delta very well. However, some >> > do (like VM disk images). Git could split large objects into smaller >> > chunks, similar to bup, and find deltas between these much more >> > manageable chunks. There are some preliminary patches in this >> > direction, but they are in need of review and expansion." >> > >> > Can anyone elaborate a little bit why many large files do not delta >> > very well? >> >> Large files are usually binary. Depends on the type of binary, they >> may or may not delta well. Those that are compressed/encrypted >> obviously don't delta well because one change can make the final >> result completely different. > > I would add that the larger a file, the larger the temptation to use a > compressed format for it, so that large files are often compressed binaries. > > For these, a trick to obtain good deltas can be to decompress before splitting > in chunks with the rsync algorithm. Git filters can already be used for this, > but it can be tricky to assure that the decompress - recompress roundtrip > re-creates the original compressed file. > > Furhermore, some compressed binaries are internally composed by multiple streams > (think of a zip archive containing multiple files, but this is by no means > limited to zip). In this case, it is frequent to have many possible orderings of > the streams. If so, the best deltas can be obtained by sorting the streams in > some 'canonical' order and decompressing. Even without decompressing, sorting > alone can obtain good results as long as changes are only due to changes in a > single stream of the container. Personally, I know no example of git filters > used to perform this sorting which can be extremely tricky in assuring the > possibility of recovering the file in the original stream order. > > Maybe (but this is just speculation), once the bup-inspired file chunking > support is in place, people will start contributing filters to improve the > management of many types of standard files (obviously 'improve' in terms of > space efficiency as filters can be quite slow). > > Sergio > > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html