On 26/05/2020 17:42, Xiaoguang Wang wrote: > Yes, I don't try to make all io_uring instances in the system share threads, I just > make io_uring instances which are bound to same cpu core, share one io_sq_thread that > only is created once for every cpu core. > Otherwise in current io_uring mainline codes, we'd better not bind different io_uring > instances to same cpu core, some instances' busy loop in its sq_thread_idle period will > impact other instanes who currently there are reqs to handle. I got a bit carried away from your initial case, but there is the case: Let's we have 2 unrelated apps that create SQPOLL io_uring instances. Let's say they are in 2 different docker container for the argument (and let's just assume a docker container user can create such). The first app1 submits 32K fat requests as described before. The second one (app2) is a low-latency app, submits reqs by 1, but expects it to be picked really fast. And let's assume their SQPOLL threads pinned to the same CPU. 1. old version: The CPU spends some time allocated by a scheduler on 32K requests of app1, probably not issuing them all but that's fine. And then it goes to the app2. So, the submit-to-pickup latency for app2 is capped by a task scheduler. That's somewhat fair. 2. your version: io_sq_thread first processes all 32K of requests of app1, and only then goes to app2. app2 is screwed, unfair as life can be. And a malicious user can create many io_uring instances as in app1. So the latency will be further multiplied. Any solution I can think of is ugly and won't ever land upstream. Like creating your own scheduling framework for io_uring, wiring kindof cgroups, etc. And actually SQPOLL shouldn't be so ubiquitous (+needs privileges). E.g. I expect there will be a single app per system using it, e.g. a database consuming most of the resources anyway. And that's why I think it's better to not trying to solve your original issue. However, what the patch can be easily remade into is sharing an SQPOLL thread between io_uring instances of a single app/user, like passing fd described before. The most obvious example is to share 1 SQPOLL thread (or N << num_cpus) between all user threads, so - still creating io_uring per thread to not synchronise SQ - retaining CPU time for real user work (instead of having N SQPOLL threads) - increasing polling efficiency (more work -- less idle polling) - and scheduling/task migration, etc. note: would be great to check, that it has all necessary cond_resched() > >> specified set of them. In other words, making yours @percpu_threads not global, >> but rather binding to a set of io_urings. >> >> e.g. create 2 independent per-cpu sets of threads. 1st one for {a1,a2,a3}, 2nd >> for {b1,b2,b3}. >> >> a1 = create_uring() >> a2 = create_uring(shared_sq_threads=a1) >> a3 = create_uring(shared_sq_threads=a1) >> >> b1 = create_uring() >> b2 = create_uring(shared_sq_threads=b1) >> b3 = create_uring(shared_sq_threads=b1) > Thanks your suggestions, I'll try to consider it in V2 patch. > > But I still have one question: now "shared_sq_threads=a1" and "shared_sq_threads=b1" > can be bound to same cpu core? I think it's still not efficient. E.g. "shared_sq_threads=a1" > busy loop in its sq_thread_idle period will impact shared_sq_threads=b1 who currently > there are reqs to handle. > > Now I also wonder whether a io_uring instance,which has SQPOLL enabed and does not > specify a sq_thread_cpu, can be used in real business environment, the io_sq_thread > may run in any cpu cores, which may affect any other application, e.g. cpu resource > contend. So if trying to use SQPOLL, we'd better allocate specified cpu. I think it's ok. - needs privileges to create SQPOLL - don't expect it used without thinking twice - for not-pinned case everything will be rescheduled to other cpus as appropriate It shouldn't affect much in normal circumstances. -- Pavel Begunkov