Re: GSoC - Some questions on the idea of

Neal Kreitzinger <nkreitzinger@xxxxxxxxx> · Sat, 31 Mar 2012 15:28:06 -0500

On 3/30/2012 3:34 PM, Jeff King wrote:
On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:

The sub-problems of "delta for large file" problem.

1 large file

But let's take a step back for a moment. Forget about whether a file is
binary or not. Imagine you want to store a very large file in git.

What are the operations that will perform badly? How can we make them
perform acceptably, and what tradeoffs must we make? E.g., the way the
diff code is written, it would be very difficult to run "git diff" on a
2 gigabyte file. But is that actually a problem? Answering that means
talking about the characteristics of 2 gigabyte files, and what we
expect to see, and to what degree our tradeoffs will impact them.

Here's a more concrete example. At first, even storing a 2 gigabyte file
with "git add" was painful, because we would load the whole thing in
memory. Repacking the repository was painful, because we had to rewrite
the whole 2G file into a packfile. Nowadays, we stream large files
directly into their own packfiles, and we have to pay the I/O only once
(and the memory cost never). As a tradeoff, we no longer get delta
compression of large objects. That's OK for some large objects, like
movie files (which don't tend to delta well, anyway). But it's not for
other objects, like virtual machine images, which do tend to delta well.

So can we devise a solution which efficiently stores these
delta-friendly objects, without losing the performance improvements we
got with the stream-directly-to-packfile approach?

One possible solution is breaking large files into smaller chunks using
something like the bupsplit algorithm (and I won't go into the details
here, as links to bup have already been mentioned elsewhere, and Junio's
patches make a start at this sort of splitting).

(I'm no expert on "big-files" in git or elsewhere, but this thread is 
immensely interesting to me as a git user who wants to track all sorts 
of binary files and possibly large text files in the very near future, 
ie. all components tied to a server build and upgrades beyond the 
linux-distro/rpms and perhaps including them also.)

Let's take an even bigger step back for a moment.  Who determines if a 
file shall be a big-file or not?  Git or the user?  How is it determined 
if a file shall be a "big-file" or not?

Who decides bigness:
Bigness seems to be relative to system resources.  Does the user crunch 
the numbers to determine if a file is big-file, or does git?  If the 
numbers are relative then should git query the system and make the 
determination?  Either way, once the system-resources are upgraded and 
formerly "big-files" are no longer considered "big" how is the previous 
history refactored to behave "non-big-file-like"?  Conversely, if the 
system-resources are re-distributed so that formerly non-big files are 
now relatively big (ie, moved from powerful central server login to 
laptops), how is the history refactored to accommodate the 
newly-relative-bigness?

How bigness is decided:
There seems to be two basic types of big-files:  big-worktree-files, and 
big-history-files.  A big-worktree-file that is delta-friendly is not a 
big-history-file.  A non-big-worktree-file that is delta-unfriendly is a 
big-file-history problem.  If you are working alone on an old computer 
you are probably more concerned about big-worktree-files (memory).  If 
you are working in a large group making lots of changes to the same 
files on a powerful server then you are probably more concerned about 
big-history-file-size (diskspace).  Of course, all are concerned about 
big-worktree-files that are delta-unfriendly.

At what point is a delta-friendly file considered a "big-file"?  I 
assume that may depend on the degree delta-friendliness.  I imagine that 
a text file and vm-image differ in delta-friendliness by several degrees.

At what point(s) is a delta-unfriendly file considered a "big-file"?  I 
assume that may depend on the degree(s) of delta-unfriendliness.  I 
imagine a compiled program and compressed-container differ in 
delta-unfriendliness by several degrees.

My understanding is that git does not ever delta-compress binary files. 
 That would mean even a small-worktree-binary-file becomes a 
big-history-file over time.

v/r,
neal
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html