On Thu, 26 Oct 2023 14:36:36 -0400 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote: > > static inline unsigned long > > xchg(volatile unsigned *ptr, unsigned new) > > { > > unsigned ret = new; > > > > asm volatile("xchg %b0,%1" > > which has an implicit lock prefix (xchg with a memory operand is a > special-case): > > Quoting Intel manual: > > "If a memory operand is referenced, the processor’s locking protocol is > automatically implemented for the duration of the exchange operation, > regardless of the presence or absence of the LOCK prefix or of the value > of the IOPL. (See the LOCK prefix description in this chapter for more > information on the locking protocol.)" > Bah. I learn something new every day :-p I thought xchg required specifying lock too. This means that cmpxchg is actually faster than xchg! It's been a long time since I had written assembly. What makes this interesting though, is that when I ran my tests originally, when my change was disabled, the xchg() code never executed, and it still showed a significant improvement. That is, even with adding xchg(), it still improved much more. But that's probably because the xchg() happens after releasing the lock and after that it goes into a loop waiting for another thread to make the update, which requires a lock cmpxchg(). Hence, the xchg() slowdown wasn't in a critical path of the test. Anyway, I changed the code to use: static inline unsigned clrbit(volatile unsigned *ptr) { unsigned ret; asm volatile("andb %b1,%0" : "+m" (*(volatile char *)ptr) : "iq" (0x2) : "memory"); ret = *ptr; *ptr = 0; return ret; } I just used andb as that's a locally atomic operation. I could also use bts (as Mathieu suggested to me on IRC). It doesn't matter as this is just example code and is not in production. How this is implemented is not an important part of the algorithm. Just that it can atomically clear the bit it sets without a race with the kernel setting the bit it sets. I could modify the code to put these bits in separate bytes as well. That's just an implementation detail we can work on later. I updated my test to have: static bool no_rseq; static void init_extend_map(void) { int ret; if (no_rseq) return; ret = rseq(&rseq_map, sizeof(rseq_map), 0, 0); perror("rseq"); printf("ret = %d (%zd) %p\n", ret, sizeof(rseq_map), &rseq_map); } static inline unsigned clrbit(volatile unsigned *ptr) { unsigned ret; asm volatile("andb %b1,%0" : "+m" (*(volatile char *)ptr) : "iq" (0x2) : "memory"); ret = *ptr; *ptr = 0; return ret; } static void extend(void) { if (no_rseq) return; rseq_map.cr_flags = 1; } static void unextend(void) { unsigned prev; if (no_rseq) return; prev = clrbit(&rseq_map.cr_flags); if (prev & 2) { tracefs_printf(NULL, "Yield!\n"); sched_yield(); } } where the tests with it disabled do not run the extend or unextend() code (as it makes sense to only perform that code with this feature, as that would be part of the overhead to implement it). Here's the results: With no_rseq = true, with 5 runs, which were basically all the same, but the best run was: Ran for 4260797 times Total wait time: 33.905306 And with no_rseq = false; Most runs were like: Ran for 5378845 times Total wait time: 32.253018 But the worse run was: Ran for 4991411 times Total wait time: 31.745662 But with that lower "wait time" it probably was preempted by a something else during that run, as the lower the "wait time" the better the result. -- Steve