I appreciate for the instant reply. My comments are inline below. On Fri, Mar 30, 2012 at 4:34 PM, Jeff King <peff@xxxxxxxx> wrote: > On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote: > >> The sub-problems of "delta for large file" problem. >> >> 1 large file >> >> 1.1 text file (always delta well? need to be confirmed) > > They often do, but text files don't tend to be large. There are some > exceptions (e.g., genetic data is often kept in line-oriented text > files, but is very large). > > But let's take a step back for a moment. Forget about whether a file is > binary or not. Imagine you want to store a very large file in git. > > What are the operations that will perform badly? How can we make them > perform acceptably, and what tradeoffs must we make? E.g., the way the > diff code is written, it would be very difficult to run "git diff" on a > 2 gigabyte file. But is that actually a problem? Answering that means > talking about the characteristics of 2 gigabyte files, and what we > expect to see, and to what degree our tradeoffs will impact them. > > Here's a more concrete example. At first, even storing a 2 gigabyte file > with "git add" was painful, because we would load the whole thing in > memory. Repacking the repository was painful, because we had to rewrite > the whole 2G file into a packfile. Nowadays, we stream large files > directly into their own packfiles, and we have to pay the I/O only once > (and the memory cost never). As a tradeoff, we no longer get delta > compression of large objects. That's OK for some large objects, like > movie files (which don't tend to delta well, anyway). But it's not for > other objects, like virtual machine images, which do tend to delta well. It seems that we should first provide some kind of mechanism which can distinguish the delta-friendly objects and non delta-friendly objects. I am wondering whether this algorithm is available now or will be devised. > > So can we devise a solution which efficiently stores these > delta-friendly objects, without losing the performance improvements we > got with the stream-directly-to-packfile approach? Ah, I see. Design efficient solution for storing the delta-friendly objects is the main concern. Thank you for helping me clarify this point. > > One possible solution is breaking large files into smaller chunks using > something like the bupsplit algorithm (and I won't go into the details > here, as links to bup have already been mentioned elsewhere, and Junio's > patches make a start at this sort of splitting). > > Note that there are other problem areas with big files that can be > worked on, too. For example, some people want to store 100 gigabytes in > a repository. Because git is distributed, that means 100G in the repo > database, and 100G in the working directory, for a total of 200G. People > in this situation may want to be able to store part of the repository > database in a network-accessible location, trading some of the > convenience of being fully distributed for the space savings. So another > project could be designing a network-based alternate object storage > system. >From the architecture point of view, CVS is fully centralized, and Git is fully distributed. It seems that for big repo, the architecture described above is in the middle now ^-^. > > -Peff Bo -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html