On 03/06/2020 17:47, Pavel Begunkov wrote: > On 26/05/2020 17:42, Xiaoguang Wang wrote: >> Yes, I don't try to make all io_uring instances in the system share threads, I just >> make io_uring instances which are bound to same cpu core, share one io_sq_thread that >> only is created once for every cpu core. >> Otherwise in current io_uring mainline codes, we'd better not bind different io_uring >> instances to same cpu core, some instances' busy loop in its sq_thread_idle period will >> impact other instanes who currently there are reqs to handle. > > I got a bit carried away from your initial case, but there is the case: > Let's we have 2 unrelated apps that create SQPOLL io_uring instances. Let's say they are > in 2 different docker container for the argument (and let's just assume a docker container > user can create such). > > The first app1 submits 32K fat requests as described before. The second one (app2) is a > low-latency app, submits reqs by 1, but expects it to be picked really fast. And let's > assume their SQPOLL threads pinned to the same CPU. > > 1. old version: > The CPU spends some time allocated by a scheduler on 32K requests of app1, > probably not issuing them all but that's fine. And then it goes to the app2. > So, the submit-to-pickup latency for app2 is capped by a task scheduler. > That's somewhat fair. > > 2. your version: > io_sq_thread first processes all 32K of requests of app1, and only then goes to app2. > app2 is screwed, unfair as life can be. And a malicious user can create many io_uring > instances as in app1. So the latency will be further multiplied. > > > Any solution I can think of is ugly and won't ever land upstream. Like creating your > own scheduling framework for io_uring, wiring kindof cgroups, etc. And actually SQPOLL > shouldn't be so ubiquitous (+needs privileges). E.g. I expect there will be a single > app per system using it, e.g. a database consuming most of the resources anyway. > And that's why I think it's better to not trying to solve your original issue. > > > However, what the patch can be easily remade into is sharing an SQPOLL thread between > io_uring instances of a single app/user, like passing fd described before. > The most obvious example is to share 1 SQPOLL thread (or N << num_cpus) between all > user threads, so > - still creating io_uring per thread to not synchronise SQ > - retaining CPU time for real user work (instead of having N SQPOLL threads) > - increasing polling efficiency (more work -- less idle polling) > - and scheduling/task migration, etc. > Just to add a thing, basically this doesn't differ much from having 1 io_uring per bunch of threads but replacing SQ synchronisation with the round-robin polling. If going this way, it'd need a thorough evaluation of performance benefits (if any). > > note: would be great to check, that it has all necessary cond_resched() > -- Pavel Begunkov