On Sat, Mar 31, 2012 at 4:28 PM, Neal Kreitzinger <nkreitzinger@xxxxxxxxx> wrote: > On 3/30/2012 3:34 PM, Jeff King wrote: >> >> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote: >> >>> The sub-problems of "delta for large file" problem. >>> >>> 1 large file >>> >> But let's take a step back for a moment. Forget about whether a file is >> binary or not. Imagine you want to store a very large file in git. >> >> What are the operations that will perform badly? How can we make them >> perform acceptably, and what tradeoffs must we make? E.g., the way the >> diff code is written, it would be very difficult to run "git diff" on a >> 2 gigabyte file. But is that actually a problem? Answering that means >> talking about the characteristics of 2 gigabyte files, and what we >> expect to see, and to what degree our tradeoffs will impact them. >> >> Here's a more concrete example. At first, even storing a 2 gigabyte file >> with "git add" was painful, because we would load the whole thing in >> memory. Repacking the repository was painful, because we had to rewrite >> the whole 2G file into a packfile. Nowadays, we stream large files >> directly into their own packfiles, and we have to pay the I/O only once >> (and the memory cost never). As a tradeoff, we no longer get delta >> compression of large objects. That's OK for some large objects, like >> movie files (which don't tend to delta well, anyway). But it's not for >> other objects, like virtual machine images, which do tend to delta well. >> >> So can we devise a solution which efficiently stores these >> delta-friendly objects, without losing the performance improvements we >> got with the stream-directly-to-packfile approach? >> >> One possible solution is breaking large files into smaller chunks using >> something like the bupsplit algorithm (and I won't go into the details >> here, as links to bup have already been mentioned elsewhere, and Junio's >> patches make a start at this sort of splitting). >> > (I'm no expert on "big-files" in git or elsewhere, but this thread is > immensely interesting to me as a git user who wants to track all sorts of > binary files and possibly large text files in the very near future, ie. all > components tied to a server build and upgrades beyond the linux-distro/rpms > and perhaps including them also.) > > Let's take an even bigger step back for a moment. Who determines if a file > shall be a big-file or not? Git or the user? How is it determined if a > file shall be a "big-file" or not? > > Who decides bigness: > Bigness seems to be relative to system resources. Does the user crunch the > numbers to determine if a file is big-file, or does git? If the numbers are > relative then should git query the system and make the determination? > Either way, once the system-resources are upgraded and formerly "big-files" > are no longer considered "big" how is the previous history refactored tot > behave "non-big-file-like"? Conversely, if the system-resources are > re-distributed so that formerly non-big files are now relatively big (ie, > moved from powerful central server login to laptops), how is the history > refactored to accommodate the newly-relative-bigness? > In common sense, a file of tens of MBs should not be considered as a big file, but a file of tens of GBs should definitely be considered as a big file. I think one simple workable solution is to let the user set the threshold of the big file. One complicate but intelligent solution is to let git auto-config the threshold by evaluating current computing resources in the running platform (a physical machine or just a VM). As to the problem of migrating git in different platforms which equip with different computing power, the git repo should also keep tract of under what big file threshold a specific file is handled. > How bigness is decided: > There seems to be two basic types of big-files: big-worktree-files, and > big-history-files. A big-worktree-file that is delta-friendly is not a > big-history-file. A non-big-worktree-file that is delta-unfriendly is a > big-file-history problem. If you are working alone on an old computer you > are probably more concerned about big-worktree-files (memory). If you are > working in a large group making lots of changes to the same files on a > powerful server then you are probably more concerned about > big-history-file-size (diskspace). Of course, all are concerned about > big-worktree-files that are delta-unfriendly. > > At what point is a delta-friendly file considered a "big-file"? I assume > that may depend on the degree delta-friendliness. I imagine that a text > file and vm-image differ in delta-friendliness by several degrees. > > At what point(s) is a delta-unfriendly file considered a "big-file"? I > assume that may depend on the degree(s) of delta-unfriendliness. I imagine > a compiled program and compressed-container differ in delta-unfriendliness > by several degrees. > > My understanding is that git does not ever delta-compress binary files. > That would mean even a small-worktree-binary-file becomes a > big-history-file over time. > > v/r, > neal -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html