Re: [PATCH] io_uring: create percpu io sq thread when IORING_SETUP_SQ_AFF is flagged

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 3 Jun 2020 21:48:53 +0300

On 03/06/2020 17:47, Pavel Begunkov wrote:
> On 26/05/2020 17:42, Xiaoguang Wang wrote:
>> Yes, I don't try to make all io_uring instances in the system share threads, I just
>> make io_uring instances which are bound to same cpu core, share one io_sq_thread that
>> only is created once for every cpu core.
>> Otherwise in current io_uring mainline codes, we'd better not bind different io_uring
>> instances to same cpu core,  some instances' busy loop in its sq_thread_idle period will
>> impact other instanes who currently there are reqs to handle.
> 
> I got a bit carried away from your initial case, but there is the case:
> Let's we have 2 unrelated apps that create SQPOLL io_uring instances. Let's say they are
> in 2 different docker container for the argument (and let's just assume a docker container
> user can create such).
> 
> The first app1 submits 32K fat requests as described before. The second one (app2) is a
> low-latency app, submits reqs by 1, but expects it to be picked really fast. And let's
> assume their SQPOLL threads pinned to the same CPU.
> 
> 1. old version:
> The CPU spends some time allocated by a scheduler on 32K requests of app1,
> probably not issuing them all but that's fine. And then it goes to the app2.
> So, the submit-to-pickup latency for app2 is capped by a task scheduler.
> That's somewhat fair. 
> 
> 2. your version:
> io_sq_thread first processes all 32K of requests of app1, and only then goes to app2.
> app2 is screwed, unfair as life can be. And a malicious user can create many io_uring
> instances as in app1. So the latency will be further multiplied.
> 
> 
> Any solution I can think of is ugly and won't ever land upstream. Like creating your
> own scheduling framework for io_uring, wiring kindof cgroups, etc. And actually SQPOLL
> shouldn't be so ubiquitous (+needs privileges). E.g. I expect there will be a single
> app per system using it, e.g. a database consuming most of the resources anyway.
> And that's why I think it's better to not trying to solve your original issue.
> 
> 
> However, what the patch can be easily remade into is sharing an SQPOLL thread between
> io_uring instances of a single app/user, like passing fd described before.
> The most obvious example is to share 1 SQPOLL thread (or N << num_cpus) between all
> user threads, so
> - still creating io_uring per thread to not synchronise SQ
> - retaining CPU time for real user work (instead of having N SQPOLL threads)
> - increasing polling efficiency (more work -- less idle polling)
> - and scheduling/task migration, etc.
> 

Just to add a thing, basically this doesn't differ much from having 1 io_uring
per bunch of threads but replacing SQ synchronisation with the round-robin polling.
If going this way, it'd need a thorough evaluation of performance benefits (if any).

> 
> note: would be great to check, that it has all necessary cond_resched()
> 

-- 
Pavel Begunkov