Re: [PATCH v6] io_uring: Statistics of the true utilization of sq threads.

Jens Axboe <axboe@xxxxxxxxx> · Sat, 30 Dec 2023 16:24:23 -0700

On 12/30/23 4:17 PM, Pavel Begunkov wrote:
> On 12/30/23 22:17, Jens Axboe wrote:
>> On 12/30/23 2:06 PM, Pavel Begunkov wrote:
>>> On 12/30/23 17:41, Jens Axboe wrote:
>>>> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>>>>> On 12/26/23 16:32, Jens Axboe wrote:
>>>>>>
>>>>>> On Mon, 25 Dec 2023 13:44:38 +0800, Xiaobing Li wrote:
>>>>>>> Count the running time and actual IO processing time of the sqpoll
>>>>>>> thread, and output the statistical data to fdinfo.
>>>>>>>
>>>>>>> Variable description:
>>>>>>> "work_time" in the code represents the sum of the jiffies of the sq
>>>>>>> thread actually processing IO, that is, how many milliseconds it
>>>>>>> actually takes to process IO. "total_time" represents the total time
>>>>>>> that the sq thread has elapsed from the beginning of the loop to the
>>>>>>> current time point, that is, how many milliseconds it has spent in
>>>>>>> total.
>>>>>>>
>>>>>>> [...]
>>>>>>
>>>>>> Applied, thanks!
>>>>>>
>>>>>> [1/1] io_uring: Statistics of the true utilization of sq threads.
>>>>>>          commit: 9f7e5872eca81d7341e3ec222ebdc202ff536655
>>>>>
>>>>> I don't believe the patch is near complete, there are still
>>>>> pending question that the author ignored (see replies to
>>>>> prev revisions).
>>>>
>>>> We can drop and defer, that's not an issue. It's still sitting top of
>>>> branch.
>>>>
>>>> Can you elaborate on the pending questions?
>>>
>>> I guess that wasn't clear, but I duplicated all of them in the
>>> email you're replying to for convenience
>>>
>>>>> Why it uses jiffies instead of some task run time?
>>>>> Consequently, why it's fine to account irq time and other
>>>>> preemption? (hint, it's not)
>>>>
>>>> Yeah that's a good point, might be better to use task run time. Jiffies
>>>> is also an annoying metric to expose, as you'd need to then get the tick
>>>> rate as well. Though I suspect the ratio is the interesting bit here.
>>>
>>> I agree that seconds are nicer, but that's not my point. That's
>>> not about jiffies, but that the patch keeps counting regardless
>>> whether the SQ task was actually running, or the CPU was serving
>>> irq, or even if it was force descheduled.
>>
>> Right, guess I wasn't clear, I did very much agree with using task run
>> time to avoid cases like that where it's perceived running, but really
>> isn't. For example.
>>
>>> I even outlined what a solution may look like, i.e. replace jiffies
>>> with task runtime, which should already be counted in the task.
>>
>> Would be a good change to make. And to be fair, I guess they originally
>> wanted something like that, as the very first patch had some scheduler
>> interactions. Just wasn't done quite right.
> 
> Right, just like what the v1 was doing but without touching
> core/sched.

Yep

>>>>> Why it can't be done with userspace and/or bpf? Why
>>>>> can't it be estimated by checking and tracking
>>>>> IORING_SQ_NEED_WAKEUP in userspace?
>>>>
>>>> Asking people to integrate bpf for this is a bit silly imho. Tracking
>>>
>>> I haven't seen any mention of the real use case, did I miss it?
>>> Because otherwise I fail to see how it can possibly be called
>>> silly when it's not clear how exactly it's used.
>>>
>>> Maybe it's a bash program printing stats to a curious user? Or
>>> maybe it's to track once at start, and then nobody cares about
>>> it, in which case NEED_WAKEUP would be justified.
>>>
>>> I can guess it's for adjusting the sq timeouts, but who knows.
>>
>> I only know what is in those threads, but the most obvious use case
>> would indeed be to vet the efficiency of the chosen timeout value and
>> balance cpu usage with latency like that.
>>
>>>> NEED_WAKEUP is also quite cumbersome and would most likely be higher
>>>> overhead as well.
>>>
>>> Comparing to reading a procfs file or doing an io_uring
>>> register syscall? I doubt that. It's also not everyone
>>> would be using that.
>>
>> What's the proposed integration to make NEED_WAKEUP sampling work? As
>> far as I can tell, you'd need to either do that kind of accounting every
>> time you do io_uring_submit(), or make it conditional which would then
>> at least still have a branch.
>>
>> The kernel side would obviously not be free either, but at least it
>> would be restricted to the SQPOLL side of things and not need to get
>> entangled with the general IO path that doesn't use SQPOLL.
>>
>> If we put it in there and have some way to enable/query/disable, then it
>> least it would just be a branch or two in there rather than in the
>> generic path.
> 
> It can be in the app without ever touching liburing/kernel. E.g. you
> can binary search the idle time, minimising it, while looking that
> sqpoll doesn't go to sleep too often topping with other conditions.
> Another stat you can get right away is to compare the current idle
> time with the total time since last wake up (which should be idle +
> extra).

Sure, it's more a question of if you want a simple to query metric, or
build something a lot more involved to do it. Guess it depends on the
use case. I think exposing a metric makes sense, as it'll provide an
easy way for an app to gauge efficiency of it and whether it'd make
sense to tweak settings. But I guess then what comes next is a way to
modify the timeout setting at runtime...

>>>>> What's the use case in particular? Considering that
>>>>> one of the previous revisions was uapi-less, something
>>>>> is really fishy here. Again, it's a procfs file nobody
>>>>> but a few would want to parse to use the feature.
>>>>
>>>> I brought this up earlier too, fdinfo is not a great API. For anything,
>>>> really.
>>>
>>> I saw that comment, that's why I mentioned, but the
>>> point is that I have doubts the author is even using
>>> the uapi.
>>
>> Not sure I follow... If they aren't using the API, what's the point of
>> the patch? Or are you questioning whether this is being done for an
>> actual use case, or just as a "why not, might be handy" kind of thing?
> I assume there is a use case, which hasn't been spelled out
> though AFAIK, but as a matter of fact earlier patches had no uapi.
> 
> https://lore.kernel.org/all/20231106074055.1248629-1-xiaobing.li@xxxxxxxxxxx/

Might be because they already have some patches exposing these metrics.
But ->

> and later patches use a suspiciously inconvenient interface,
> i.e. /proc. It's a good question how it was supposed to be used
> (and tested), but I have only guesses. Private patches? A custom
> module? Or maybe a genuine mistake, which is why I'm still very
> curious hearing about the application.

Agree, Xiaobing, can you expand on the actual use case here? How are you
intending to use this metric?

-- 
Jens Axboe