On 3/9/20 2:03 PM, Pavel Begunkov wrote: > On 24/02/2020 18:22, Jens Axboe wrote: >> On 2/24/20 2:35 AM, Andres Freund wrote: >>> Hi, >>> >>> On 2020-02-14 13:49:31 -0700, Jens Axboe wrote: >>>> [description of buffered write workloads being slower via io_uring >>>> than plain writes] >>>> Because I'm working on other items, I didn't read carefully enough. Yes >>>> this won't change the situation for writes. I'll take a look at this when >>>> I get time, maybe there's something we can do to improve the situation. >>> >>> I looked a bit into this. >>> >>> I think one issue is the spinning the workers do: >>> >>> static int io_wqe_worker(void *data) >>> { >>> >>> while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { >>> set_current_state(TASK_INTERRUPTIBLE); >>> loop: >>> if (did_work) >>> io_worker_spin_for_work(wqe); >>> spin_lock_irq(&wqe->lock); >>> if (io_wqe_run_queue(wqe)) { >>> >>> static inline void io_worker_spin_for_work(struct io_wqe *wqe) >>> { >>> int i = 0; >>> >>> while (++i < 1000) { >>> if (io_wqe_run_queue(wqe)) >>> break; >>> if (need_resched()) >>> break; >>> cpu_relax(); >>> } >>> } >>> >>> even with the cpu_relax(), that causes quite a lot of cross socket >>> traffic, slowing down the submission side. Which after all frequently >>> needs to take the wqe->lock, just to be able to submit a queue >>> entry. >>> >>> lock, work_list, flags all reside in one cacheline, so it's pretty >>> likely that a single io_wqe_enqueue would get the cacheline "stolen" >>> several times during one enqueue - without allowing any progress in the >>> worker, of course. >> >> Since it's provably harmful for this case, and the gain was small (but >> noticeable) on single issue cases, I think we should just kill it. With >> the poll retry stuff for 5.7, there'll be even less of a need for it. >> >> Care to send a patch for 5.6 to kill it? >> >>> I also wonder if we can't avoid dequeuing entries one-by-one within the >>> worker, at least for the IO_WQ_WORK_HASHED case. Especially when writes >>> are just hitting the page cache, they're pretty fast, making it >>> plausible to cause pretty bad contention on the spinlock (even without >>> the spining above). Whereas the submission side is at least somewhat >>> likely to be able to submit several queue entries while the worker is >>> processing one job, that's pretty unlikely for workers. >>> >>> In the hashed case there shouldn't be another worker processing entries >>> for the same hash. So it seems quite possible for the wqe to drain a few >>> of the entries for that hash within one spinlock acquisition, and then >>> process them one-by-one? >> >> Yeah, I think that'd be a good optimization for hashed work. Work N+1 >> can't make any progress before work N is done anyway, so might as well >> grab a batch at the time. >> > > A problem here is that we actually have a 2D array of works because of linked > requests. You could either skip anything with a link, or even just ignore it and simply re-queue a dependent link if it isn't hashed when it's done if grabbed in a batch. > We can io_wqe_enqueue() dependant works, if have hashed requests, so delegating > it to other threads. But if the work->list is not per-core, it will hurt > locality. Either re-enqueue hashed ones if there is a dependant work. Need to > think how to do better. If we ignore links for a second, I think we can all agree that it'd be a big win to do the batch. With links, worst case would then be something where every other link is hashed. For a first patch, I'd be quite happy to just stop the batch if there's a link on a request. The normal case here is buffered writes, and that'll handle that case perfectly. Links will be no worse than before. Seems like a no-brainer to me. -- Jens Axboe