Re: [PATCH v2 14/21] read_packed_refs(): ensure that references are ordered when read

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Thu, 21 Sep 2017 10:27:14 +0200

On 09/20/2017 08:50 PM, Jeff King wrote:
> On Tue, Sep 19, 2017 at 08:22:22AM +0200, Michael Haggerty wrote:
>> If `packed-refs` was already sorted, then (if the system allows it) we
>> can use the mmapped file contents directly. But if the system doesn't
>> allow a file that is currently mmapped to be replaced using
>> `rename()`, then it would be bad for us to keep the file mmapped for
>> any longer than necessary. So, on such systems, always make a copy of
>> the file contents, either as part of the sorting process, or
>> afterwards.
> 
> So this sort-of answers my question from the previous commit (why we
> care about the distinction between NONE and TEMPORARY), since we now
> start treating them differently.
> 
> But I'm a bit confused why there's any advantage in the TEMPORARY case
> to doing the mmap-and-copy versus just treating it like NONE and reading
> it directly.
> 
> Hmm, I guess it comes down to the double-allocation thing again? Let me
> see if I have this right:
> 
>   1. For NO_MMAP, we'd read the buffer once. If it's sorted, good. If
>      not, then we temporarily hold it in memory twice while we copy it
>      into the new sorted buffer.
> 
>   2. For TEMPORARY, we mmap once. If it's sorted, then we make a single
>      copy. If it's not sorted, then we do the copy+sort as a single
>      step.
> 
>   3. For MMAP_OK, if it's sorted, we're done. Otherwise, we do the
>      single copy.
> 
> So this is really there to help the TEMPORARY case reading an old
> unsorted file avoid needing to use double-the-ram during the copy?
> 
> That seems like a good reason (and it does not seem to add too much
> complexity).

Yes, that's correct. A deeper question is whether it's worth this extra
effort to optimize the conversion from "unsorted" to "known-sorted",
which it seems should only happen once per repository. But

* Many more Git commands read the `packed-refs` file than write it.
  So an "unsorted" file might have to be read multiple times before
  the first time it is rewritten with the "sorted" trait.

* Users might have multiple Git clients writing their `packed-refs`
  file (e.g., the git CLI plus JGit in Eclipse), and they might not
  upgrade both clients at the same time to versions that write the
  "sorted" trait. So a single repository might go back and forth
  between "unsorted" and "known-sorted" multiple times in its
  life.

* Theoretically, somebody might have a `packed-refs` file that is so
  big that it takes up more than half of memory. I think there are
  scenarios where such a file could be converted to "sorted" form
  in `MMAP_TEMPORARY` mode but would fail in `MMAP_NONE` mode.

On the downside, in my benchmarking on Linux, there were hints that the
`MMAP_TEMPORARY` branch might be a *tiny* bit slower than the `MMAP_OK`
branch when operating on a known-sorted `packed-refs` file. But the
speed difference was barely detectable (and might be illusory). And
anyway, the speed difference on Linux is moot, since on Linux `MMAP_OK`
mode will usually be used. It *would* be interesting to know if there is
a significant speed difference on Windows. Dscho?

Michael