On 12/30/23 4:17 PM, Pavel Begunkov wrote: > On 12/30/23 22:17, Jens Axboe wrote: >> On 12/30/23 2:06 PM, Pavel Begunkov wrote: >>> On 12/30/23 17:41, Jens Axboe wrote: >>>> On 12/30/23 9:27 AM, Pavel Begunkov wrote: >>>>> On 12/26/23 16:32, Jens Axboe wrote: >>>>>> >>>>>> On Mon, 25 Dec 2023 13:44:38 +0800, Xiaobing Li wrote: >>>>>>> Count the running time and actual IO processing time of the sqpoll >>>>>>> thread, and output the statistical data to fdinfo. >>>>>>> >>>>>>> Variable description: >>>>>>> "work_time" in the code represents the sum of the jiffies of the sq >>>>>>> thread actually processing IO, that is, how many milliseconds it >>>>>>> actually takes to process IO. "total_time" represents the total time >>>>>>> that the sq thread has elapsed from the beginning of the loop to the >>>>>>> current time point, that is, how many milliseconds it has spent in >>>>>>> total. >>>>>>> >>>>>>> [...] >>>>>> >>>>>> Applied, thanks! >>>>>> >>>>>> [1/1] io_uring: Statistics of the true utilization of sq threads. >>>>>> commit: 9f7e5872eca81d7341e3ec222ebdc202ff536655 >>>>> >>>>> I don't believe the patch is near complete, there are still >>>>> pending question that the author ignored (see replies to >>>>> prev revisions). >>>> >>>> We can drop and defer, that's not an issue. It's still sitting top of >>>> branch. >>>> >>>> Can you elaborate on the pending questions? >>> >>> I guess that wasn't clear, but I duplicated all of them in the >>> email you're replying to for convenience >>> >>>>> Why it uses jiffies instead of some task run time? >>>>> Consequently, why it's fine to account irq time and other >>>>> preemption? (hint, it's not) >>>> >>>> Yeah that's a good point, might be better to use task run time. Jiffies >>>> is also an annoying metric to expose, as you'd need to then get the tick >>>> rate as well. Though I suspect the ratio is the interesting bit here. >>> >>> I agree that seconds are nicer, but that's not my point. That's >>> not about jiffies, but that the patch keeps counting regardless >>> whether the SQ task was actually running, or the CPU was serving >>> irq, or even if it was force descheduled. >> >> Right, guess I wasn't clear, I did very much agree with using task run >> time to avoid cases like that where it's perceived running, but really >> isn't. For example. >> >>> I even outlined what a solution may look like, i.e. replace jiffies >>> with task runtime, which should already be counted in the task. >> >> Would be a good change to make. And to be fair, I guess they originally >> wanted something like that, as the very first patch had some scheduler >> interactions. Just wasn't done quite right. > > Right, just like what the v1 was doing but without touching > core/sched. Yep >>>>> Why it can't be done with userspace and/or bpf? Why >>>>> can't it be estimated by checking and tracking >>>>> IORING_SQ_NEED_WAKEUP in userspace? >>>> >>>> Asking people to integrate bpf for this is a bit silly imho. Tracking >>> >>> I haven't seen any mention of the real use case, did I miss it? >>> Because otherwise I fail to see how it can possibly be called >>> silly when it's not clear how exactly it's used. >>> >>> Maybe it's a bash program printing stats to a curious user? Or >>> maybe it's to track once at start, and then nobody cares about >>> it, in which case NEED_WAKEUP would be justified. >>> >>> I can guess it's for adjusting the sq timeouts, but who knows. >> >> I only know what is in those threads, but the most obvious use case >> would indeed be to vet the efficiency of the chosen timeout value and >> balance cpu usage with latency like that. >> >>>> NEED_WAKEUP is also quite cumbersome and would most likely be higher >>>> overhead as well. >>> >>> Comparing to reading a procfs file or doing an io_uring >>> register syscall? I doubt that. It's also not everyone >>> would be using that. >> >> What's the proposed integration to make NEED_WAKEUP sampling work? As >> far as I can tell, you'd need to either do that kind of accounting every >> time you do io_uring_submit(), or make it conditional which would then >> at least still have a branch. >> >> The kernel side would obviously not be free either, but at least it >> would be restricted to the SQPOLL side of things and not need to get >> entangled with the general IO path that doesn't use SQPOLL. >> >> If we put it in there and have some way to enable/query/disable, then it >> least it would just be a branch or two in there rather than in the >> generic path. > > It can be in the app without ever touching liburing/kernel. E.g. you > can binary search the idle time, minimising it, while looking that > sqpoll doesn't go to sleep too often topping with other conditions. > Another stat you can get right away is to compare the current idle > time with the total time since last wake up (which should be idle + > extra). Sure, it's more a question of if you want a simple to query metric, or build something a lot more involved to do it. Guess it depends on the use case. I think exposing a metric makes sense, as it'll provide an easy way for an app to gauge efficiency of it and whether it'd make sense to tweak settings. But I guess then what comes next is a way to modify the timeout setting at runtime... >>>>> What's the use case in particular? Considering that >>>>> one of the previous revisions was uapi-less, something >>>>> is really fishy here. Again, it's a procfs file nobody >>>>> but a few would want to parse to use the feature. >>>> >>>> I brought this up earlier too, fdinfo is not a great API. For anything, >>>> really. >>> >>> I saw that comment, that's why I mentioned, but the >>> point is that I have doubts the author is even using >>> the uapi. >> >> Not sure I follow... If they aren't using the API, what's the point of >> the patch? Or are you questioning whether this is being done for an >> actual use case, or just as a "why not, might be handy" kind of thing? > I assume there is a use case, which hasn't been spelled out > though AFAIK, but as a matter of fact earlier patches had no uapi. > > https://lore.kernel.org/all/20231106074055.1248629-1-xiaobing.li@xxxxxxxxxxx/ Might be because they already have some patches exposing these metrics. But -> > and later patches use a suspiciously inconvenient interface, > i.e. /proc. It's a good question how it was supposed to be used > (and tested), but I have only guesses. Private patches? A custom > module? Or maybe a genuine mistake, which is why I'm still very > curious hearing about the application. Agree, Xiaobing, can you expand on the actual use case here? How are you intending to use this metric? -- Jens Axboe