Hi, On 2020-02-14 13:49:31 -0700, Jens Axboe wrote: > [description of buffered write workloads being slower via io_uring > than plain writes] > Because I'm working on other items, I didn't read carefully enough. Yes > this won't change the situation for writes. I'll take a look at this when > I get time, maybe there's something we can do to improve the situation. I looked a bit into this. I think one issue is the spinning the workers do: static int io_wqe_worker(void *data) { while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { set_current_state(TASK_INTERRUPTIBLE); loop: if (did_work) io_worker_spin_for_work(wqe); spin_lock_irq(&wqe->lock); if (io_wqe_run_queue(wqe)) { static inline void io_worker_spin_for_work(struct io_wqe *wqe) { int i = 0; while (++i < 1000) { if (io_wqe_run_queue(wqe)) break; if (need_resched()) break; cpu_relax(); } } even with the cpu_relax(), that causes quite a lot of cross socket traffic, slowing down the submission side. Which after all frequently needs to take the wqe->lock, just to be able to submit a queue entry. lock, work_list, flags all reside in one cacheline, so it's pretty likely that a single io_wqe_enqueue would get the cacheline "stolen" several times during one enqueue - without allowing any progress in the worker, of course. I also wonder if we can't avoid dequeuing entries one-by-one within the worker, at least for the IO_WQ_WORK_HASHED case. Especially when writes are just hitting the page cache, they're pretty fast, making it plausible to cause pretty bad contention on the spinlock (even without the spining above). Whereas the submission side is at least somewhat likely to be able to submit several queue entries while the worker is processing one job, that's pretty unlikely for workers. In the hashed case there shouldn't be another worker processing entries for the same hash. So it seems quite possible for the wqe to drain a few of the entries for that hash within one spinlock acquisition, and then process them one-by-one? Greetings, Andres Freund