Re: [PATCH] don't use mmap() to hash files

Dmitry Potapov <dpotapov@xxxxxxxxx> · Mon, 15 Feb 2010 15:23:59 +0300

On Mon, Feb 15, 2010 at 8:05 AM, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:
> On Sun, 14 Feb 2010, Dmitry Potapov wrote:
>
>> 1. to introduce a configuration parameter that will define whether to use
>> mmap() to hash files or not. It is a trivial change, but the real question
>> is what default value for this option (should we do some heuristic based
>> on filesize vs available memory?)
>
> I don't like such kind of heuristic.  They're almost always wrong, and
> any issue is damn hard to reproduce. I tend to believe that mmap() works
> better by letting the OS paging in and out memory as needed while
> reading data into allocated memory is only going to force the system
> into swap.

Probably, you are right. Heuristic is a bad idea. Still, we may want to
add an option to disable mmap() during hash calculation if we preserve
mmap() here. Though, I don't like keeping mmap() there if we go for #2..
See below...

>
>> 2. to stream files in chunks. It is better because it is faster, especially on
>> large files, as you calculate SHA-1 and zip data while they are in CPU
>> cache. However, it may be more difficult to implement, because we have
>> filters that should be apply to files that are put to the repository.
>
> So?  "More difficult" when it is the right thing to do is no excuse not
> to do it and satisfy ourselves with an half solution.  Barely replacing
> mmap() with read() has drawbacks while the advantages aren't that many.
> Gaining a few speed percentage while making it less robust when memory
> is tight isn't such a great compromize to me.  BUT if you were to
> replace mmap() with read() and make the process chunked then you do
> improve both speed _and_ memory usage.

I have not had time to look closely at this, but there is one problem
that I noticed -- the header of any git object contains the blob length.
We know this length in advance (without reading all data) only for
regular files and only if they do not have any filter to be applied.
In all other cases, it seems we cannot do much better than we do now,
assuming that we do not want to change the storage format...

If so, the question remains what to do about regular files with some
filter. Currently, we use mmap() for the original data but store the
processed data in memory anyway. The question is whether want to keep
this use of mmap() here? Considering that it is a potential source of a
repository corruption and these filters should not be used for big files
because they take a lot of memory anyway, I think we should get rid of
mmap() in hashing file completely, once we can process regular files
without filters in chunks.

Dmitry
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html