On 1/17/24 1:18 PM, Bart Van Assche wrote: > On 1/17/24 12:06, Jens Axboe wrote: >> Case in point, I spent 10 min hacking up some smarts on the insertion >> and dispatch side, and then we get: >> >> IOPS=2.54M, BW=1240MiB/s, IOS/call=32/32 >> >> or about a 63% improvement when running the _exact same thing_. Looking >> at profiles: >> >> - 13.71% io_uring [kernel.kallsyms] [k] queued_spin_lock_slowpath >> >> reducing the > 70% of locking contention down to ~14%. No change in data >> structures, just an ugly hack that: >> >> - Serializes dispatch, no point having someone hammer on dd->lock for >> dispatch when already running >> - Serialize insertions, punt to one of N buckets if insertion is already >> busy. Current insertion will notice someone else did that, and will >> prune the buckets and re-run insertion. >> >> And while I seriously doubt that my quick hack is 100% fool proof, it >> works as a proof of concept. If we can get that kind of reduction with >> minimal effort, well... > > If nobody else beats me to it then I will look into using separate > locks in the mq-deadline scheduler for insertion and dispatch. That's not going to help by itself, as most of the contention (as I showed in the profile trace in the email) is from dispatch competing with itself, and not necessarily dispatch competing with insertion. And not sure how that would even work, as insert and dispatch are working on the same structures. Do some proper analysis first, then that will show you where the problem is. -- Jens Axboe