On Sun, 14 Feb 2010, Dmitry Potapov wrote: > 1. to introduce a configuration parameter that will define whether to use > mmap() to hash files or not. It is a trivial change, but the real question > is what default value for this option (should we do some heuristic based > on filesize vs available memory?) I don't like such kind of heuristic. They're almost always wrong, and any issue is damn hard to reproduce. I tend to believe that mmap() works better by letting the OS paging in and out memory as needed while reading data into allocated memory is only going to force the system into swap. > 2. to stream files in chunks. It is better because it is faster, especially on > large files, as you calculate SHA-1 and zip data while they are in CPU > cache. However, it may be more difficult to implement, because we have > filters that should be apply to files that are put to the repository. So? "More difficult" when it is the right thing to do is no excuse not to do it and satisfy ourselves with an half solution. Barely replacing mmap() with read() has drawbacks while the advantages aren't that many. Gaining a few speed percentage while making it less robust when memory is tight isn't such a great compromize to me. BUT if you were to replace mmap() with read() and make the process chunked then you do improve both speed _and_ memory usage. As to huge file: we have that core.bigFileThreshold variable now, and anything that crosses it should be considered "stream in / stream out" without further considerations. That means no diff, no rename similarity estimates, no delta, no filter, no blame, no fancies. If you have source code files that big then you do have a bigger problem already anyway. Typical huge files are rarely manipulated, and when they do it is pretty unlikely to be compared with other versions using diff, and then that also means that you have the storage capacity and network bandwidth to deal with them. Hence repository tightness is not your top concern in that case, but repack/checkout speed most likely is. So big files should be streamed to a pack of their own at "git add" time. Then repack will simply "reuse pack data" without delta compression attempts, meaning that they will be streamed into a single huge pack with no issue (this particular case is already supported in the code). > 3. to improve Git to support huge files on computers with low memory. That comes for free with #2. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html