Re: mempool and cacheline ping pong

Loïc Dachary <loic@xxxxxxxxxxx> · Mon, 5 Apr 2021 16:38:30 +0200



[snip]
>
> Hi Loïc:
> Your new sharding version looks much better.  I do not see any cacheline contention at all. 
>
> Here's some insight into the difference.
> The atomic update you're doing to the lock-variable has to both read and write the lock-variable and does it while it has the cacheline for the lock-variable locked.  
>
> The cpu doing that atomic instruction needs to first get ownership of the cacheline.  If no other threads of execution are also trying to get ownership of that cacheline, then ownership is granted rather quickly. 
> If, however, there are many other threads of execution trying to get ownership of that cacheline, then all those cpus must "get in line" and wait their turn.
>
And if there are N CPUs and N*2 threads, the only way to avoid cacheline contention is if all of these threads use a different (cacheline aligned) 
variable. Is that correct? If it is correct, I wonder if cacheline contention has a linear impact on performances. If there is cacheline contention on 1 variable with N threads on N CPUS, will the performance degradation be the same if there is cacheline contention on 10 variables and the same number of threads & CPUS? Or will it get worse because there is some sort of amplification?
> In your "without-sharding" case, the average number of machine cycles needed for the atomic instruction to gain ownership of the cacheline was 751 machine cycles.  In the "with-sharding" case, it dropped to 84 machine cycles.
>
> Understand, however, the above numbers are not perfectly accurate.  
That's because perf was instructed to ignore any load instructions that took faster than 70 machine cycles to complete.  The reasoning is at those low levels of machine cycles, there is no contention, so why burden 
the perf tool execution and data collection with the extra processing of fast loads that aren't relevant to finding cacheline contention.   I mention this for completeness.  The drop from 751 to 84 machine cycles is significant.
Thanks for the crystal clear explanation. I'd like for the test script to 
extract those number from the files produced by perf c2c. How do you suggest I go about this?
>
> Do you have something in your code to guarantee that nothing else resides in the same aligned "128 byte 2-cacheline block" as your locks?
>
These 128 bytes are not used, they are just padding to make sure nothing else is stored there.

Cheers

[0] https://lab.fedeproxy.eu/ceph/ceph/-/blob/wip-mempool-cacheline-49781/src/include/mempool.h#L195-203

Attachment:
OpenPGP_signature

Description: OpenPGP digital signature
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx