From: Dave Hansen > Sent: 11 April 2023 14:44 > > On 4/11/23 04:35, Mark Rutland wrote: > > I agree it'd be nice to have performance figures, but I think those would only > > need to demonstrate a lack of a regression rather than a performance > > improvement, and I think it's fairly clear from eyeballing the generated > > instructions that a regression isn't likely. > > Thanks for the additional context. > > I totally agree that there's zero burden here to show a performance > increase. If anyone can think of a quick way to do _some_ kind of > benchmark on the code being changed and just show that it's free of > brown paper bags, it would be appreciated. Nothing crazy, just think of > one workload (synthetic or not) that will stress the paths being changed > and run it with and without these changes. Make sure there are not > surprises. > > I also agree that it's unlikely to be brown paper bag material. The only thing I can think of is that, on x86, the locked variant may actually be faster! Both require exclusive access to the cache line (the unlocked variant always does the write! [1]). So if the cache line is contended between cpu the unlocked variant might ping-pong the cache line twice! Of course, if the line is shared like that then performance is horrid. [1] I checked on an uncached PCIe address on which I can monitor the TLP. The write always happens so you can use cmpxchg18b with a 'known bad value' to do a 16 byte read as a single TLP (without using an SSE register). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)