Re: GSoC - Some questions on the idea of

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I appreciate for the instant reply.

My comments are inline below.

On Fri, Mar 30, 2012 at 4:34 PM, Jeff King <peff@xxxxxxxx> wrote:
> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:
>
>> The sub-problems of "delta for large file" problem.
>>
>> 1 large file
>>
>> 1.1 text file (always delta well? need to be confirmed)
>
> They often do, but text files don't tend to be large. There are some
> exceptions (e.g., genetic data is often kept in line-oriented text
> files, but is very large).
>
> But let's take a step back for a moment. Forget about whether a file is
> binary or not. Imagine you want to store a very large file in git.
>
> What are the operations that will perform badly? How can we make them
> perform acceptably, and what tradeoffs must we make? E.g., the way the
> diff code is written, it would be very difficult to run "git diff" on a
> 2 gigabyte file. But is that actually a problem? Answering that means
> talking about the characteristics of 2 gigabyte files, and what we
> expect to see, and to what degree our tradeoffs will impact them.
>
> Here's a more concrete example. At first, even storing a 2 gigabyte file
> with "git add" was painful, because we would load the whole thing in
> memory. Repacking the repository was painful, because we had to rewrite
> the whole 2G file into a packfile. Nowadays, we stream large files
> directly into their own packfiles, and we have to pay the I/O only once
> (and the memory cost never). As a tradeoff, we no longer get delta
> compression of large objects. That's OK for some large objects, like
> movie files (which don't tend to delta well, anyway). But it's not for
> other objects, like virtual machine images, which do tend to delta well.

It seems that we should first provide some kind of mechanism which can
distinguish the delta-friendly objects and non delta-friendly objects.
I am wondering whether this algorithm is available now or will be
devised.



>
> So can we devise a solution which efficiently stores these
> delta-friendly objects, without losing the performance improvements we
> got with the stream-directly-to-packfile approach?

Ah, I see. Design efficient solution for storing the delta-friendly
objects is the main concern. Thank you for helping me clarify this
point.

>
> One possible solution is breaking large files into smaller chunks using
> something like the bupsplit algorithm (and I won't go into the details
> here, as links to bup have already been mentioned elsewhere, and Junio's
> patches make a start at this sort of splitting).
>
> Note that there are other problem areas with big files that can be
> worked on, too. For example, some people want to store 100 gigabytes in
> a repository. Because git is distributed, that means 100G in the repo
> database, and 100G in the working directory, for a total of 200G. People
> in this situation may want to be able to store part of the repository
> database in a network-accessible location, trading some of the
> convenience of being fully distributed for the space savings. So another
> project could be designing a network-based alternate object storage
> system.

>From the architecture point of view, CVS is fully centralized, and Git
is fully distributed. It seems that for big repo, the architecture
described above is in the middle now ^-^.

>
> -Peff

Bo
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]