Re: [PATCH v6] io_uring: Statistics of the true utilization of sq threads.

Pavel Begunkov <asml.silence@xxxxxxxxx> · Sat, 30 Dec 2023 23:17:31 +0000

On 12/30/23 22:17, Jens Axboe wrote:
On 12/30/23 2:06 PM, Pavel Begunkov wrote:
On 12/30/23 17:41, Jens Axboe wrote:
On 12/30/23 9:27 AM, Pavel Begunkov wrote:
On 12/26/23 16:32, Jens Axboe wrote:

On Mon, 25 Dec 2023 13:44:38 +0800, Xiaobing Li wrote:
Count the running time and actual IO processing time of the sqpoll
thread, and output the statistical data to fdinfo.

Variable description:
"work_time" in the code represents the sum of the jiffies of the sq
thread actually processing IO, that is, how many milliseconds it
actually takes to process IO. "total_time" represents the total time
that the sq thread has elapsed from the beginning of the loop to the
current time point, that is, how many milliseconds it has spent in
total.

[...]

Applied, thanks!

[1/1] io_uring: Statistics of the true utilization of sq threads.
         commit: 9f7e5872eca81d7341e3ec222ebdc202ff536655

I don't believe the patch is near complete, there are still
pending question that the author ignored (see replies to
prev revisions).

We can drop and defer, that's not an issue. It's still sitting top of
branch.

Can you elaborate on the pending questions?

I guess that wasn't clear, but I duplicated all of them in the
email you're replying to for convenience

Why it uses jiffies instead of some task run time?
Consequently, why it's fine to account irq time and other
preemption? (hint, it's not)

Yeah that's a good point, might be better to use task run time. Jiffies
is also an annoying metric to expose, as you'd need to then get the tick
rate as well. Though I suspect the ratio is the interesting bit here.

I agree that seconds are nicer, but that's not my point. That's
not about jiffies, but that the patch keeps counting regardless
whether the SQ task was actually running, or the CPU was serving
irq, or even if it was force descheduled.

Right, guess I wasn't clear, I did very much agree with using task run
time to avoid cases like that where it's perceived running, but really
isn't. For example.

I even outlined what a solution may look like, i.e. replace jiffies
with task runtime, which should already be counted in the task.

Would be a good change to make. And to be fair, I guess they originally
wanted something like that, as the very first patch had some scheduler
interactions. Just wasn't done quite right.

Right, just like what the v1 was doing but without touching
core/sched.

Why it can't be done with userspace and/or bpf? Why
can't it be estimated by checking and tracking
IORING_SQ_NEED_WAKEUP in userspace?

Asking people to integrate bpf for this is a bit silly imho. Tracking

I haven't seen any mention of the real use case, did I miss it?
Because otherwise I fail to see how it can possibly be called
silly when it's not clear how exactly it's used.

Maybe it's a bash program printing stats to a curious user? Or
maybe it's to track once at start, and then nobody cares about
it, in which case NEED_WAKEUP would be justified.

I can guess it's for adjusting the sq timeouts, but who knows.

I only know what is in those threads, but the most obvious use case
would indeed be to vet the efficiency of the chosen timeout value and
balance cpu usage with latency like that.

NEED_WAKEUP is also quite cumbersome and would most likely be higher
overhead as well.

Comparing to reading a procfs file or doing an io_uring
register syscall? I doubt that. It's also not everyone
would be using that.

What's the proposed integration to make NEED_WAKEUP sampling work? As
far as I can tell, you'd need to either do that kind of accounting every
time you do io_uring_submit(), or make it conditional which would then
at least still have a branch.

The kernel side would obviously not be free either, but at least it
would be restricted to the SQPOLL side of things and not need to get
entangled with the general IO path that doesn't use SQPOLL.

If we put it in there and have some way to enable/query/disable, then it
least it would just be a branch or two in there rather than in the
generic path.

It can be in the app without ever touching liburing/kernel. E.g. you
can binary search the idle time, minimising it, while looking that
sqpoll doesn't go to sleep too often topping with other conditions.
Another stat you can get right away is to compare the current idle
time with the total time since last wake up (which should be idle +
extra).

It might be enough for _some_ apps, but due to the fog of war
I assume there are currently 0 apps supposed that are supposed
to use it anyway.

What's the use case in particular? Considering that
one of the previous revisions was uapi-less, something
is really fishy here. Again, it's a procfs file nobody
but a few would want to parse to use the feature.

I brought this up earlier too, fdinfo is not a great API. For anything,
really.

I saw that comment, that's why I mentioned, but the
point is that I have doubts the author is even using
the uapi.

Not sure I follow... If they aren't using the API, what's the point of
the patch? Or are you questioning whether this is being done for an
actual use case, or just as a "why not, might be handy" kind of thing?
I assume there is a use case, which hasn't been spelled out
though AFAIK, but as a matter of fact earlier patches had no uapi.

https://lore.kernel.org/all/20231106074055.1248629-1-xiaobing.li@xxxxxxxxxxx/

and later patches use a suspiciously inconvenient interface,
i.e. /proc. It's a good question how it was supposed to be used
(and tested), but I have only guesses. Private patches? A custom
module? Or maybe a genuine mistake, which is why I'm still very
curious hearing about the application.

--
Pavel Begunkov