Re: tsan: t5400: set_try_to_free_routine

Jeff King <peff@xxxxxxxx> · Thu, 17 Aug 2017 06:57:02 -0400

On Tue, Aug 15, 2017 at 02:53:07PM +0200, Martin Ågren wrote:

> Using SANITIZE=thread made t5400-send-pack.sh hit the potential race
> below.
> 
> This is set_try_to_free_routine in wrapper.c. The race relates to the
> reading of the "old" value. The caller doesn't care about the "old"
> value, so this should be harmless right now. But it seems that using
> this mechanism from multiple threads and restoring the earlier value
> will probably not work out every time. (Not necessarily because of the
> race in set_try_to_free_routine, but, e.g., because callers might not
> restore the function pointer in the reverse order of how they
> originally set it.)
> 
> Properly "fixing" this for thread-safety would probably require some
> redesigning, which at the time might not be warranted. I'm just posting
> this for completeness.
> 
> Martin
> 
> WARNING: ThreadSanitizer: data race (pid=21382)
>   Read of size 8 at 0x000000979970 by thread T1:
>     #0 set_try_to_free_routine wrapper.c:35 (git+0x0000006cde1c)
>     #1 prepare_trace_line trace.c:105 (git+0x0000006a3bf0)
>     #2 trace_strbuf_fl trace.c:185 (git+0x0000006a3bf0)
>     #3 packet_trace pkt-line.c:80 (git+0x0000005f9f43)
>     #4 packet_read pkt-line.c:309 (git+0x0000005fbe10)
>     #5 recv_sideband sideband.c:37 (git+0x000000684c5e)
>     #6 sideband_demux send-pack.c:216 (git+0x00000065a38c)
>     #7 run_thread run-command.c:933 (git+0x000000655a93)
>     #8 <null> <null> (libtsan.so.0+0x0000000230d9)

I was curious why the trace code would care about the free routine in
the first place. Digging in the mailing list, I didn't find a lot of
discussion. But I think the problem is basically that the trace
infrastructure wants to be thread-safe, but the default free-pack-memory
callback isn't.

It's ironic that we fix the thread-unsafety of the free-pack-memory
function by using the also-thread-unsafe set_try_to_free_routine.

Further irony: the trace routines aren't thread-safe in the first place,
as they do lazy initialization of key->fd using an "initialized" field.
In practice it probably means double-writing key->fd and leaking a
descriptor (though there are no synchronizing operations there, so it's
entirely possible a compiler could reorder the assignments to key->fd
and key->initialized and a simultaneous reader could read a garbage
key->fd value).  We also call getenv(), which isn't thread-safe with
other calls to getenv() or setenv().

I can think of a few possible directions:

  1. Make set_try_to_free_routine() skip the write if it would be a
     noop. This is racy if threads are actually changing the value, but
     in practice they aren't (the first trace of any kind will set it to
     NULL, and it will remain there).

  2. Make the free-packed routine thread-safe by taking a lock. It
     should hardly ever be called, so performance wouldn't matter. The
     big question is: _which_ lock.  pack-objects, which uses threads
     already, has a version which does this. But it knows to take the
     big program-wide "I'm accessing unsafe parts of Git" lock that the
     rest of the program uses during its multi-threaded parts.
     There's no notion in the rest of Git for "now we're going into a
     multi-threaded part, so most calls will need to take a big global
     lock before doing anything interesting".

     For parts of Git that are explicitly multi-threaded (like the
     pack-objects delta search, or index-pack's delta resolution) that's
     not so bad. But the example above is just using a sideband demuxer.
     It would be unfortunate if the entire rest of send-pack had to
     start caring about taking that lock.

  3. Is the free-pack-memory thing actually accomplishing much these
     days? It comes from 97bfeb34df (Release pack windows before
     reporting out of memory., 2006-12-24), and the primary issue is not
     actual allocated memory, but mmap'd packs clogging up the address
     space so that malloc can't find a suitable block.

     On 64-bit systems this is likely doing nothing. We have tons of
     address space. But even on 32-bit systems, the default
     core.packedGitLimit is only 256MiB (which was set around the same
     time). You can certainly come up with a corner case where freeing
     up that address space could matter. But I'd be surprised if this
     has actually helped much in practice over the years. And if you
     have a repo which is running so close to the address space limits
     of your system, the right answer is probably: upgrade to a 64-bit
     system. Even if the try-to-free thing helped in one run, it's
     likely that similar runs are not going to be so lucky, and even
     with it you're going to see sporadic out-of-memory failures.

-Peff