Re: [PATCH] io_uring: create percpu io sq thread when IORING_SETUP_SQ_AFF is flagged

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 3 Jun 2020 17:47:24 +0300

On 26/05/2020 17:42, Xiaoguang Wang wrote:
> Yes, I don't try to make all io_uring instances in the system share threads, I just
> make io_uring instances which are bound to same cpu core, share one io_sq_thread that
> only is created once for every cpu core.
> Otherwise in current io_uring mainline codes, we'd better not bind different io_uring
> instances to same cpu core,  some instances' busy loop in its sq_thread_idle period will
> impact other instanes who currently there are reqs to handle.

I got a bit carried away from your initial case, but there is the case:
Let's we have 2 unrelated apps that create SQPOLL io_uring instances. Let's say they are
in 2 different docker container for the argument (and let's just assume a docker container
user can create such).

The first app1 submits 32K fat requests as described before. The second one (app2) is a
low-latency app, submits reqs by 1, but expects it to be picked really fast. And let's
assume their SQPOLL threads pinned to the same CPU.

1. old version:
The CPU spends some time allocated by a scheduler on 32K requests of app1,
probably not issuing them all but that's fine. And then it goes to the app2.
So, the submit-to-pickup latency for app2 is capped by a task scheduler.
That's somewhat fair. 

2. your version:
io_sq_thread first processes all 32K of requests of app1, and only then goes to app2.
app2 is screwed, unfair as life can be. And a malicious user can create many io_uring
instances as in app1. So the latency will be further multiplied.

Any solution I can think of is ugly and won't ever land upstream. Like creating your
own scheduling framework for io_uring, wiring kindof cgroups, etc. And actually SQPOLL
shouldn't be so ubiquitous (+needs privileges). E.g. I expect there will be a single
app per system using it, e.g. a database consuming most of the resources anyway.
And that's why I think it's better to not trying to solve your original issue.

However, what the patch can be easily remade into is sharing an SQPOLL thread between
io_uring instances of a single app/user, like passing fd described before.
The most obvious example is to share 1 SQPOLL thread (or N << num_cpus) between all
user threads, so
- still creating io_uring per thread to not synchronise SQ
- retaining CPU time for real user work (instead of having N SQPOLL threads)
- increasing polling efficiency (more work -- less idle polling)
- and scheduling/task migration, etc.

note: would be great to check, that it has all necessary cond_resched()

> 
>> specified set of them. In other words, making yours @percpu_threads not global,
>> but rather binding to a set of io_urings.
>>
>> e.g. create 2 independent per-cpu sets of threads. 1st one for {a1,a2,a3}, 2nd
>> for {b1,b2,b3}.
>>
>> a1 = create_uring()
>> a2 = create_uring(shared_sq_threads=a1)
>> a3 = create_uring(shared_sq_threads=a1)
>>
>> b1 = create_uring()
>> b2 = create_uring(shared_sq_threads=b1)
>> b3 = create_uring(shared_sq_threads=b1)
> Thanks your suggestions, I'll try to consider it in V2 patch.
> 
> But I still have one question: now "shared_sq_threads=a1" and "shared_sq_threads=b1"
> can be bound to same cpu core? I think it's still not efficient. E.g. "shared_sq_threads=a1"
> busy loop in its sq_thread_idle period will impact shared_sq_threads=b1 who currently
> there are reqs to handle.
> 
> Now I also wonder whether a io_uring instance,which has SQPOLL enabed and does not
> specify a sq_thread_cpu, can be used in real business environment, the io_sq_thread
> may run in any cpu cores, which may affect any other application, e.g. cpu resource
> contend. So if trying to use SQPOLL, we'd better allocate specified cpu.

I think it's ok.
- needs privileges to create SQPOLL
- don't expect it used without thinking twice
- for not-pinned case everything will be rescheduled to other cpus as appropriate
  It shouldn't affect much in normal circumstances.

-- 
Pavel Begunkov