Re: Buffered IO async context overhead

Jens Axboe <axboe@xxxxxxxxx> · Mon, 9 Mar 2020 14:41:40 -0600

On 3/9/20 2:03 PM, Pavel Begunkov wrote:
> On 24/02/2020 18:22, Jens Axboe wrote:
>> On 2/24/20 2:35 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2020-02-14 13:49:31 -0700, Jens Axboe wrote:
>>>> [description of buffered write workloads being slower via io_uring
>>>> than plain writes]
>>>> Because I'm working on other items, I didn't read carefully enough. Yes
>>>> this won't change the situation for writes. I'll take a look at this when
>>>> I get time, maybe there's something we can do to improve the situation.
>>>
>>> I looked a bit into this.
>>>
>>> I think one issue is the spinning the workers do:
>>>
>>> static int io_wqe_worker(void *data)
>>> {
>>>
>>> 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
>>> 		set_current_state(TASK_INTERRUPTIBLE);
>>> loop:
>>> 		if (did_work)
>>> 			io_worker_spin_for_work(wqe);
>>> 		spin_lock_irq(&wqe->lock);
>>> 		if (io_wqe_run_queue(wqe)) {
>>>
>>> static inline void io_worker_spin_for_work(struct io_wqe *wqe)
>>> {
>>> 	int i = 0;
>>>
>>> 	while (++i < 1000) {
>>> 		if (io_wqe_run_queue(wqe))
>>> 			break;
>>> 		if (need_resched())
>>> 			break;
>>> 		cpu_relax();
>>> 	}
>>> }
>>>
>>> even with the cpu_relax(), that causes quite a lot of cross socket
>>> traffic, slowing down the submission side. Which after all frequently
>>> needs to take the wqe->lock, just to be able to submit a queue
>>> entry.
>>>
>>> lock, work_list, flags all reside in one cacheline, so it's pretty
>>> likely that a single io_wqe_enqueue would get the cacheline "stolen"
>>> several times during one enqueue - without allowing any progress in the
>>> worker, of course.
>>
>> Since it's provably harmful for this case, and the gain was small (but
>> noticeable) on single issue cases, I think we should just kill it. With
>> the poll retry stuff for 5.7, there'll be even less of a need for it.
>>
>> Care to send a patch for 5.6 to kill it?
>>
>>> I also wonder if we can't avoid dequeuing entries one-by-one within the
>>> worker, at least for the IO_WQ_WORK_HASHED case. Especially when writes
>>> are just hitting the page cache, they're pretty fast, making it
>>> plausible to cause pretty bad contention on the spinlock (even without
>>> the spining above). Whereas the submission side is at least somewhat
>>> likely to be able to submit several queue entries while the worker is
>>> processing one job, that's pretty unlikely for workers.
>>>
>>> In the hashed case there shouldn't be another worker processing entries
>>> for the same hash. So it seems quite possible for the wqe to drain a few
>>> of the entries for that hash within one spinlock acquisition, and then
>>> process them one-by-one?
>>
>> Yeah, I think that'd be a good optimization for hashed work. Work N+1
>> can't make any progress before work N is done anyway, so might as well
>> grab a batch at the time.
>>
> 
> A problem here is that we actually have a 2D array of works because of linked
> requests.

You could either skip anything with a link, or even just ignore it and
simply re-queue a dependent link if it isn't hashed when it's done if
grabbed in a batch.

> We can io_wqe_enqueue() dependant works, if have hashed requests, so delegating
> it to other threads. But if the work->list is not per-core, it will hurt
> locality. Either re-enqueue hashed ones if there is a dependant work. Need to
> think how to do better.

If we ignore links for a second, I think we can all agree that it'd be a
big win to do the batch.

With links, worst case would then be something where every other link is
hashed.

For a first patch, I'd be quite happy to just stop the batch if there's
a link on a request. The normal case here is buffered writes, and
that'll handle that case perfectly. Links will be no worse than before.
Seems like a no-brainer to me.

-- 
Jens Axboe