On Sun, Sep 26, 2021 at 06:41:27AM +0200, Willy Tarreau wrote: > Hi Paul, > > On Sat, Sep 25, 2021 at 08:51:03PM -0700, Paul E. McKenney wrote: > > Hello, Willy! > > > > Continuing from linkedin: > > > > > Maybe this doesn't work as well as expected because of the common L3 cache > > > that runs at a single frequency and that imposes discrete timings. Also, > > > I noticed that on modern CPUs, cache lines tend to "stick" at least a few > > > cycles once they're in a cache, which helps the corresponding CPU chain > > > a few atomic ops undisturbed. For example on a 8-core Ryzen I'm seeing a > > > minimum of 8ns between two threads of the same core (L1 probably split in > > > two halves), 25ns between two L2 and 60ns between the two halves (CCX) > > > of the L3. This certainly makes it much harder to trigger concurrency > > > issues. Well let's continue by e-mail, it's a real pain to type in this > > > awful interface. > > > > Indeed, I get best (worst?) results from memory latency on multi-socket > > systems. And these results were not subtle: > > > > https://paulmck.livejournal.com/62071.html > > Interesting. I've been working for a few months on trying to efficiently > break compare-and-swap loops that tend to favor local nodes and to stall > other ones. A bit of background first: in haproxy we have a 2-second > watchdog that kills the process if a thread is stuck that long (that's > an eternity on an event-driven program). We've recently got reports of > the watchdog triggering on EPYC systems with blocks of 16 consecutive > threads not being able to make any progress. It turns out that such 16 > consecutive threads seem to always be in the same core-complex, i.e. > same L3, so it seems that sometimes it's hard to reclaim a line that's > stressed by other nodes. > > I wrote a program to measure inter-CPU atomic latencies and success/failure > rates to perform a CAS depending on each thread combination. Even on a > Ryzen 2700X (8 cores in 2*4 blocks) I've measured up to ~4k consecutive > failures. When trying to modify my code to try enforce some fairness, I > managed to go down to less than 4 failures. I was happy until I discovered > that there was a bug in my program making it not do anything except check > the shared variable that I'm using to enforce the fairness. So it seems > that interleaving multiple indenpendent accesses might sometimes help > provide some fairness in all of this, which is what I'm going to work on > today. The textbook approach is to partition the algorithm, for example, with a combining tree or similar. Otherwise, you are quite right, there are all sorts of hardware-specific things that can happen at high levels of memory contention. Other than that, there are a lot of heuristics involving per-node counters so that if a given node has had more than N consecutive shots and some other node wants in, it holds off until the other node gets in. Complex and tricky, so partitioning is better where it can be done. > > All that aside, any advice on portably and usefully getting 2-3x clock > > frequency differences into testing would be quite welcome. > > I think that playing with scaling_max_freq and randomly setting it > to either cpuinfo_min_freq or cpuinfo_max_freq could be a good start. > However it will likely not work inside KVM, but for bare-metal tests > it should work where cpufreq is supported. OK, that explains at least some of my difficulties. I almost always run under qemu/KVM. > Something like the untested below could be a good start, and maybe it > could even be periodically changed while the test is running: > > for p in /sys/devices/system/cpu/cpufreq/policy*; do > min=$(cat $p/cpuinfo_min_freq) > max=$(cat $p/cpuinfo_max_freq) > if ((RANDOM & 1)); then > echo $max > $p/scaling_max_freq > else > echo $min > $p/scaling_max_freq > fi > done > > Then you can undo it like this: > > for p in /sys/devices/system/cpu/cpufreq/policy*; do > cat $p/cpuinfo_max_freq > $p/scaling_max_freq > done > > On my i7-8650U laptop, the min/max freqs are 400/4200 MHz. On the previous > one (i5-3320M) it's 1200/3300. On a Ryzen 2700X it's 4000/2200. I'm seeing > 800/5000 on a core i9-9900K. So I guess that it could be a good start. If > you need the ratios to be kept tighter, we could probably improve the > script above to try intermediate values. But yes, on bare-metal setups, those ratios should be able to do some useful damage. ;-) Thank you! Thanx, Paul