Re: [PATCH] don't use mmap() to hash files

Nicolas Pitre <nico@xxxxxxxxxxx> · Mon, 15 Feb 2010 00:05:37 -0500 (EST)

On Sun, 14 Feb 2010, Dmitry Potapov wrote:

> 1. to introduce a configuration parameter that will define whether to use
> mmap() to hash files or not. It is a trivial change, but the real question
> is what default value for this option (should we do some heuristic based
> on filesize vs available memory?)

I don't like such kind of heuristic.  They're almost always wrong, and 
any issue is damn hard to reproduce. I tend to believe that mmap() works 
better by letting the OS paging in and out memory as needed while 
reading data into allocated memory is only going to force the system 
into swap.

> 2. to stream files in chunks. It is better because it is faster, especially on
> large files, as you calculate SHA-1 and zip data while they are in CPU
> cache. However, it may be more difficult to implement, because we have
> filters that should be apply to files that are put to the repository.

So?  "More difficult" when it is the right thing to do is no excuse not 
to do it and satisfy ourselves with an half solution.  Barely replacing 
mmap() with read() has drawbacks while the advantages aren't that many.  
Gaining a few speed percentage while making it less robust when memory 
is tight isn't such a great compromize to me.  BUT if you were to 
replace mmap() with read() and make the process chunked then you do 
improve both speed _and_ memory usage.

As to huge file: we have that core.bigFileThreshold variable now, and 
anything that crosses it should be considered "stream in / stream out" 
without further considerations.  That means no diff, no rename 
similarity estimates, no delta, no filter, no blame, no fancies.  If you 
have source code files that big then you do have a bigger problem 
already anyway.  Typical huge files are rarely manipulated, and when 
they do it is pretty unlikely to be compared with other versions using 
diff, and then that also means that you have the storage capacity and 
network bandwidth to deal with them.  Hence repository tightness is not 
your top concern in that case, but repack/checkout speed most likely is.

So big files should be streamed to a pack of their own at "git add" 
time.  Then repack will simply "reuse pack data" without delta 
compression attempts, meaning that they will be streamed into a 
single huge pack with no issue (this particular case is already 
supported in the code).

> 3. to improve Git to support huge files on computers with low memory.

That comes for free with #2.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html