On 24/02/2020 18:22, Jens Axboe wrote: > On 2/24/20 2:35 AM, Andres Freund wrote: >> Hi, >> >> On 2020-02-14 13:49:31 -0700, Jens Axboe wrote: >>> [description of buffered write workloads being slower via io_uring >>> than plain writes] >>> Because I'm working on other items, I didn't read carefully enough. Yes >>> this won't change the situation for writes. I'll take a look at this when >>> I get time, maybe there's something we can do to improve the situation. >> >> I looked a bit into this. >> >> I think one issue is the spinning the workers do: >> >> static int io_wqe_worker(void *data) >> { >> >> while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { >> set_current_state(TASK_INTERRUPTIBLE); >> loop: >> if (did_work) >> io_worker_spin_for_work(wqe); >> spin_lock_irq(&wqe->lock); >> if (io_wqe_run_queue(wqe)) { >> >> static inline void io_worker_spin_for_work(struct io_wqe *wqe) >> { >> int i = 0; >> >> while (++i < 1000) { >> if (io_wqe_run_queue(wqe)) >> break; >> if (need_resched()) >> break; >> cpu_relax(); >> } >> } >> >> even with the cpu_relax(), that causes quite a lot of cross socket >> traffic, slowing down the submission side. Which after all frequently >> needs to take the wqe->lock, just to be able to submit a queue >> entry. >> >> lock, work_list, flags all reside in one cacheline, so it's pretty >> likely that a single io_wqe_enqueue would get the cacheline "stolen" >> several times during one enqueue - without allowing any progress in the >> worker, of course. > > Since it's provably harmful for this case, and the gain was small (but > noticeable) on single issue cases, I think we should just kill it. With > the poll retry stuff for 5.7, there'll be even less of a need for it. > > Care to send a patch for 5.6 to kill it? > >> I also wonder if we can't avoid dequeuing entries one-by-one within the >> worker, at least for the IO_WQ_WORK_HASHED case. Especially when writes >> are just hitting the page cache, they're pretty fast, making it >> plausible to cause pretty bad contention on the spinlock (even without >> the spining above). Whereas the submission side is at least somewhat >> likely to be able to submit several queue entries while the worker is >> processing one job, that's pretty unlikely for workers. >> >> In the hashed case there shouldn't be another worker processing entries >> for the same hash. So it seems quite possible for the wqe to drain a few >> of the entries for that hash within one spinlock acquisition, and then >> process them one-by-one? > > Yeah, I think that'd be a good optimization for hashed work. Work N+1 > can't make any progress before work N is done anyway, so might as well > grab a batch at the time. > A problem here is that we actually have a 2D array of works because of linked requests. We can io_wqe_enqueue() dependant works, if have hashed requests, so delegating it to other threads. But if the work->list is not per-core, it will hurt locality. Either re-enqueue hashed ones if there is a dependant work. Need to think how to do better. -- Pavel Begunkov