Re: Making races happen more often

Willy Tarreau <w@xxxxxx> · Mon, 27 Sep 2021 08:49:15 +0200

Hi Paul,

On Sun, Sep 26, 2021 at 08:25:17PM -0700, Paul E. McKenney wrote:
> > I wrote a program to measure inter-CPU atomic latencies and success/failure
> > rates to perform a CAS depending on each thread combination. Even on a
> > Ryzen 2700X (8 cores in 2*4 blocks) I've measured up to ~4k consecutive
> > failures. When trying to modify my code to try enforce some fairness, I
> > managed to go down to less than 4 failures. I was happy until I discovered
> > that there was a bug in my program making it not do anything except check
> > the shared variable that I'm using to enforce the fairness. So it seems
> > that interleaving multiple indenpendent accesses might sometimes help
> > provide some fairness in all of this, which is what I'm going to work on
> > today.
> 
> The textbook approach is to partition the algorithm, for example, with
> a combining tree or similar.

That's what I'm working on for the long term. For the short term I'm
trying to measure the average time spent trying and failing, and setting
a threshold above which other threads do not start to work and let others
finish first. This gives very good results, with peak CAS failures going
down from ~300 to ~16 in my tests, and average time being even lower than
before. However it's a bit addictive, and playing a bit too much with
tuning tends to favor some archs at the expense of others. My experiments
are here for the curious, though probably not very interesting at the moment:

   https://github.com/wtarreau/atomic-tests  

> Otherwise, you are quite right, there are
> all sorts of hardware-specific things that can happen at high levels of
> memory contention.

There's not just memory contention, I found some severe contention inside
L3 caches themselves. For example, AMD Ryzen CPUs have their caches arranged
as "core complexes" of 4 cores connected to a same block of L3 cache, itself
connected to the other "core complexes". Look at the measured inter-CPU
latency (in nanoseconds) for a compare-and-swap between each of them:

         0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
     0   -   8  24  24  25  24  26  26  69  68  68  69  69  69  69  69
     1   8   -  24  24  24  24  26  26  69  69  69  69  69  69  69  69
     2  24  24   -   8  23  23  24  24  69  69  68  68  69  69  68  69
     3  24  24   8   -  23  23  24  24  69  68  68  69  69  69  68  71
     4  25  25  23  23   -   8  25  25  69  69  69  69  70  70  69  70
     5  25  24  23  23   8   -  25  25  69  69  70  69  70  69  70  69
     6  26  26  24  24  25  25   -   8  69  69  69  69  69  69  69  69
     7  26  26  24  24  25  25   8   -  68  69  69  70  70  70  70  69
     8  69  69  68  68  69  69  69  69   -   8  24  24  25  25  26  26
     9  69  69  68  68  69  69  69  69   8   -  24  24  25  25  26  26
    10  69  69  69  69  69  70  69  68  24  24   -   8  23  23  24  24
    11  69  69  69  67  68  69  68  68  24  24   8   -  23  23  24  24
    12  69  69  69  68  70  70  69  69  25  25  23  23   -   8  25  25
    13  69  69  69  68  70  70  69  69  25  24  23  23   8   -  25  25
    14  69  69  68  69  70  69  69  69  26  26  24  24  25  25   -   8
    15  69  69  68  68  70  70  69  69  26  26  24  24  25  25   8   -

And if you perform a load before the CAS (like for example when computing
a max or updating a pointer), it gets better for same-core-complex and
much worse for others:

         0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
     0   -  18  21  21  23  24  22  22 124 124 123 122 125 125 124 124
     1  18   -  22  22  24  24  22  22 124 124 122 122 124 124 124 123
     2  21  22   -  18  23  23  21  56 122 122 122 122 125 125 122 124
     3  21  21  18   -  23  23  21  20 123 123 122 122 124 124 122 122
     4  24  23  24  23   -  18  24  23 124 124 123 124 126 126 124 124
     5  24  24  23  23  18   -  23  23 124 124 124 123 126 126 125 126
     6  22  22  21  20  23  23   -  18 124 124 124 123 125 124 124 124
     7  22  22  20  21  23  24  18   - 125 124 122 122 126 124 125 125
     8 123 123 122 122 124 124 122 122   -  18  20  20  24  23  21  21
     9 124 124 123 123 125 125 124 124  18   -  20  21  24  24  58  21
    10 123 123 122 122 124 122 122 122  21  21   -  18  22  23  20  20
    11 123 122 122 122 123 124 124 123  20  20  18   -  23  23  23  20
    12 125 124 124 123 126 126 125 124  23  23  23  23   -  18  24  23
    13 126 125 124 123 126 126 124 124  23  23  24  22  18   -  23  23
    14 124 124 123 122 124 124 124 124  20  20  20  21  23  25   -  18
    15 124 124 123 123 125 124 124 124  22  21  20  56  24  24  18   -

See ? There's basically a factor of 6 in latency depending on the cores
that are supposed to be connected to the same L3. And looking closer you
even observe that within a same core-complex, things are not uniform at
all either:
  - core 1 always experiences fastest communication with other cores
  - core 2 always experiences slowest communication with other cores
  - discrepancies between pairs vay by up to 50%

However when it comes to success/failures of a CAS, it's the opposite:
  - after a load, core 2 almost always succeeds a CAS (~50 tries)
  - after a load, core 1 almost always fails a CAS (~500 tries)
  - after a load, cores 0 and 3 are average (~100 tries)

This indicates to me that there's probably some arbitration cost to the
cores that access the cache, and that cache lines assigned to one core
or the other tend to take more time for some cores than others to be
reclaimed, and that cores experiencing the highest latencies are also
the ones keeping the lines longer and observing the highest success
rate on CAS.

In your case it could mean that certain races will never happen on certain
cores but could on others ones, or at least above certain frequencies,
because if the cache line sticks long enough in an L2 to perform a series
of operations undisturbed, whatever you'll do on other cores might have no
effect.

I suspect that working at much lower frequencies could at some point allow
to release a cache line between two operations. I don't seem to be observing
this on the AMD for now though, as the differences increase when lowering
the frequency from 4.0 to 2.2G for a single CAS:

         0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
     0   -  19  55  55  48  48  62  62 103 103 105 105 103 103 103 103
     1  19   -  55  55  48  48  62  62 103 103 105 105 103 103 103 103
     2  55  55   -  19  40  40  55  55 104 104 106 106 107 107 104 103
     3  55  55  19   -  40  40  55  55 104 104 105 106 107 107 104 104
     4  48  48  40  40   -  19  47  47 109 109 107 106 105 105 112 111
     5  48  48  40  40  19   -  47  47 109 109 107 106 105 105 111 111
     6  62  62  55  55  47  47   -  19 103 103 105 105 103 103 103 103
     7  62  62  55  55  47  47  19   - 103 103 105 105 103 103 103 103
     8 103 103 104 104 110 109 103 103   -  19  55  55  48  48  62  62
     9 103 103 104 104 110 110 103 103  19   -  55  55  48  48  62  62
    10 105 105 105 106 106 106 105 105  55  55   -  19  40  40  55  55
    11 105 105 105 106 107 107 105 105  55  55  19   -  40  40  55  55
    12 103 103 108 107 105 105 103 103  48  48  40  40   -  19  47  47
    13 103 103 107 107 105 105 103 103  48  48  40  40  19   -  47  47
    14 103 103 105 104 110 111 103 103  62  62  55  55  47  47   -  19
    15 103 103 104 104 111 111 103 103  62  62  55  55  47  47  19   -

Where a same core-complex would see variations from 22 to 26ns now we're
seeing 40 to 62!

> Other than that, there are a lot of heuristics involving per-node counters
> so that if a given node has had more than N consecutive shots and some
> other node wants in, it holds off until the other node gets in. Complex
> and tricky, so partitioning is better where it can be done.

I tried a bit with this but it didn't yield that interesting results
for now, I think the reason is that I'm focusing on max latency and that
with this, the max latency experienced by any CPU becomes the number of
CPUs times the max granted to a single CPU. Playing with roughly averaged
wait times instead tends to favor threads waiting too long first. The only
thing is that I try to avoid updating this when the wait time is low
enough, in order to avoid writes to a shared area. One of the side effects
of writing to such a shared areas is that some operations may be delayed
due to write ordering. I.e. writing that you're waiting for the access,
then performing a CAS will first require that the line containing that
"waiting" flag be flushed before trying the CAS, which may increase the
risk that another CPU in the mean time updates it and causes the CAS to
fail. And if the "waiting" flag is shared with the area being touched by
the CAS, then touching it to say "I'm here" is enough to slow down the
other ones. Thus I prefer to watch a shared entry that is not supposed to
be updated often, except when CPUs are fed up with waiting. Under normal
operation it should perform better.

Another point I'm quite interested in is trying to group operations from
close neighbors. In the example above on the Ryzen CPU, I would love to
see CPUs from the same core complex try their luck together. One could
think about using a CPU number to distribute accesses so that they're
grouped, but it's less convenient from userland (or even from a VM). One
element however knows better than the software, it's the cache controller.
That's why I'm trying to keep *some* contention, so that once a core gets
its access to the cache, the other ones on the same side of the L3 have a
better chance to access it immediately after. And I think that's how I've
reached an average of 45ns per operation (stats included) in my last test
across all CPUs despite inter-CPU times ranging from 18 to 125ns. But that
is quite unstable for now.

> > > All that aside, any advice on portably and usefully getting 2-3x clock
> > > frequency differences into testing would be quite welcome.
> > 
> > I think that playing with scaling_max_freq and randomly setting it
> > to either cpuinfo_min_freq or cpuinfo_max_freq could be a good start.
> > However it will likely not work inside KVM, but for bare-metal tests
> > it should work where cpufreq is supported.
> 
> OK, that explains at least some of my difficulties.  I almost always
> run under qemu/KVM.

But you could disturb the real machine with changing its frequency.

I've been wondering if it's possible to change the tick rate inside KVM,
or even program a dummy device to deliver very fast IRQs but I haven't
found such a thing for now. I've been thinking about NMIs as well. The
idea would be that there could be a higher chance of interrupting
processing at any random place using fast interrupts (>100k/s) than by
using context switching caused by scheduler slots being depleted. Maybe
one only needs to hack an existing device emulator to flood an IRQ :-/

Cheers,
Willy