On 2/24/20 2:35 AM, Andres Freund wrote: > Hi, > > On 2020-02-14 13:49:31 -0700, Jens Axboe wrote: >> [description of buffered write workloads being slower via io_uring >> than plain writes] >> Because I'm working on other items, I didn't read carefully enough. Yes >> this won't change the situation for writes. I'll take a look at this when >> I get time, maybe there's something we can do to improve the situation. > > I looked a bit into this. > > I think one issue is the spinning the workers do: > > static int io_wqe_worker(void *data) > { > > while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { > set_current_state(TASK_INTERRUPTIBLE); > loop: > if (did_work) > io_worker_spin_for_work(wqe); > spin_lock_irq(&wqe->lock); > if (io_wqe_run_queue(wqe)) { > > static inline void io_worker_spin_for_work(struct io_wqe *wqe) > { > int i = 0; > > while (++i < 1000) { > if (io_wqe_run_queue(wqe)) > break; > if (need_resched()) > break; > cpu_relax(); > } > } > > even with the cpu_relax(), that causes quite a lot of cross socket > traffic, slowing down the submission side. Which after all frequently > needs to take the wqe->lock, just to be able to submit a queue > entry. > > lock, work_list, flags all reside in one cacheline, so it's pretty > likely that a single io_wqe_enqueue would get the cacheline "stolen" > several times during one enqueue - without allowing any progress in the > worker, of course. Since it's provably harmful for this case, and the gain was small (but noticeable) on single issue cases, I think we should just kill it. With the poll retry stuff for 5.7, there'll be even less of a need for it. Care to send a patch for 5.6 to kill it? > I also wonder if we can't avoid dequeuing entries one-by-one within the > worker, at least for the IO_WQ_WORK_HASHED case. Especially when writes > are just hitting the page cache, they're pretty fast, making it > plausible to cause pretty bad contention on the spinlock (even without > the spining above). Whereas the submission side is at least somewhat > likely to be able to submit several queue entries while the worker is > processing one job, that's pretty unlikely for workers. > > In the hashed case there shouldn't be another worker processing entries > for the same hash. So it seems quite possible for the wqe to drain a few > of the entries for that hash within one spinlock acquisition, and then > process them one-by-one? Yeah, I think that'd be a good optimization for hashed work. Work N+1 can't make any progress before work N is done anyway, so might as well grab a batch at the time. -- Jens Axboe