Re: [PATCH v2 14/21] read_packed_refs(): ensure that references are ordered when read

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Mon, 25 Sep 2017 17:44:55 +0200 (CEST)

Hi Michael,

On Thu, 21 Sep 2017, Michael Haggerty wrote:

> On 09/20/2017 08:50 PM, Jeff King wrote:
> > On Tue, Sep 19, 2017 at 08:22:22AM +0200, Michael Haggerty wrote:
> >> If `packed-refs` was already sorted, then (if the system allows it)
> >> we can use the mmapped file contents directly. But if the system
> >> doesn't allow a file that is currently mmapped to be replaced using
> >> `rename()`, then it would be bad for us to keep the file mmapped for
> >> any longer than necessary. So, on such systems, always make a copy of
> >> the file contents, either as part of the sorting process, or
> >> afterwards.
> > 
> > So this sort-of answers my question from the previous commit (why we
> > care about the distinction between NONE and TEMPORARY), since we now
> > start treating them differently.
> > 
> > But I'm a bit confused why there's any advantage in the TEMPORARY case
> > to doing the mmap-and-copy versus just treating it like NONE and reading
> > it directly.
> > 
> > Hmm, I guess it comes down to the double-allocation thing again? Let me
> > see if I have this right:
> > 
> >   1. For NO_MMAP, we'd read the buffer once. If it's sorted, good. If
> >      not, then we temporarily hold it in memory twice while we copy it
> >      into the new sorted buffer.
> > 
> >   2. For TEMPORARY, we mmap once. If it's sorted, then we make a single
> >      copy. If it's not sorted, then we do the copy+sort as a single
> >      step.
> > 
> >   3. For MMAP_OK, if it's sorted, we're done. Otherwise, we do the
> >      single copy.
> > 
> > So this is really there to help the TEMPORARY case reading an old
> > unsorted file avoid needing to use double-the-ram during the copy?
> > 
> > That seems like a good reason (and it does not seem to add too much
> > complexity).
> 
> Yes, that's correct. A deeper question is whether it's worth this extra
> effort to optimize the conversion from "unsorted" to "known-sorted",
> which it seems should only happen once per repository. But
> 
> * Many more Git commands read the `packed-refs` file than write it.
>   So an "unsorted" file might have to be read multiple times before
>   the first time it is rewritten with the "sorted" trait.
> 
> * Users might have multiple Git clients writing their `packed-refs`
>   file (e.g., the git CLI plus JGit in Eclipse), and they might not
>   upgrade both clients at the same time to versions that write the
>   "sorted" trait. So a single repository might go back and forth
>   between "unsorted" and "known-sorted" multiple times in its
>   life.
> 
> * Theoretically, somebody might have a `packed-refs` file that is so
>   big that it takes up more than half of memory. I think there are
>   scenarios where such a file could be converted to "sorted" form
>   in `MMAP_TEMPORARY` mode but would fail in `MMAP_NONE` mode.
> 
> On the downside, in my benchmarking on Linux, there were hints that the
> `MMAP_TEMPORARY` branch might be a *tiny* bit slower than the `MMAP_OK`
> branch when operating on a known-sorted `packed-refs` file. But the
> speed difference was barely detectable (and might be illusory). And
> anyway, the speed difference on Linux is moot, since on Linux `MMAP_OK`
> mode will usually be used. It *would* be interesting to know if there is
> a significant speed difference on Windows. Dscho?

Sadly, I lack the time to test this thoroughly, but my experience is that
those things are dominated by I/O on Windows, i.e. that the malloc() and
copy won't matter all that much.

In any case, the same argument you make for Linux holds, in reverse, for
Windows: we cannot use MMAP_OK on Windows, so the point is moot ;-)

Ciao,
Dscho