Re: GSoC - Some questions on the idea of "Better big-file support".

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for replying late.

My questions are inline in the following.


On Wed, Mar 28, 2012 at 2:19 AM, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote:
> On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen@xxxxxxxxxxxxxx> wrote:
>> Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better
>> big-file support".
>>
>> As it is described in the idea page,
>> "Many large files (like media) do not delta very well. However, some
>> do (like VM disk images). Git could split large objects into smaller
>> chunks, similar to bup, and find deltas between these much more
>> manageable chunks. There are some preliminary patches in this
>> direction, but they are in need of review and expansion."
>>
>> Can anyone elaborate a little bit why many large files do not delta
>> very well?
>
> Large files are usually binary. Depends on the type of binary, they
> may or may not delta well. Those that are compressed/encrypted
> obviously don't delta well because one change can make the final
> result completely different.

Just make clear one of my confusions. Delta operation is to find out
the differences between different versions of the same file, right?
As I know, delta encoding is to re-encode a file based on the
differences between neighboring blocks, thus can help compress a file
since after delta encoding, we will have more similar data within the
file. Can anyone elaborate a little bit what is the relation between
delta operation in git and delta encoding listed above? Thanks.

>
> Another problem with delta-ing large files with git is, current code
> needs to load two files in memory for delta. Consuming 4G for delta 2
> 2GB files does not sound good.


I am wondering why we cannot divide the 2  2GB files into chunks and
delta chunks by chunks. Is that any difference, except a little more
IOs?

>
>> Is it a general problem or a specific problem just for Git?
>> I am really new to Git, can anyone give me some hints on which source
>> codes I should read to learn more about the current code on delta
>> operation? It is said that "there are some preliminary patches in this
>> direction", where can I find these patches?
>
> Read about rsync algorithm [2]. Bup [1] implements the same (I think)
> algorithm, but on top of git. For preliminary patches, have a look at
> jc/split-blob series at commit 4a1242d in git.git.

Make clear my another confusion. The file which has been updated
(added, deleted, and modified) is first delta-compressed, and then
synchronize to the remote repo by some mechanism (rsync?). I am
wondering what is the the relationship between delta operation and
rsync.

>
> [1] https://github.com/apenwarr/bup
> [2] http://en.wikipedia.org/wiki/Rsync#Algorithm
> --
> Duy

Bo
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]