Re: [PATCH v6] io_uring: Statistics of the true utilization of sq threads.

Pavel Begunkov <asml.silence@xxxxxxxxx> · Thu, 11 Jan 2024 13:12:36 +0000

On 1/10/24 09:05, Xiaobing Li wrote:
On 1/5/24 04:02 AM, Pavel Begunkov wrote:
On 1/3/24 05:49, Xiaobing Li wrote:
On 12/30/23 9:27 AM, Pavel Begunkov wrote:
Why it uses jiffies instead of some task run time?
Consequently, why it's fine to account irq time and other
preemption? (hint, it's not)

Why it can't be done with userspace and/or bpf? Why
can't it be estimated by checking and tracking
IORING_SQ_NEED_WAKEUP in userspace?

What's the use case in particular? Considering that
one of the previous revisions was uapi-less, something
is really fishy here. Again, it's a procfs file nobody
but a few would want to parse to use the feature.

Why it just keeps aggregating stats for the whole
life time of the ring? If the workload changes,
that would either totally screw the stats or would make
it too inert to be useful. That's especially relevant
for long running (days) processes. There should be a
way to reset it so it starts counting anew.

Hi, Jens and Pavel,
I carefully read the questions you raised.
First of all, as to why I use jiffies to statistics time, it
is because I have done some performance tests and found that
using jiffies has a relatively smaller loss of performance
than using task run time. Of course, using task run time is

How does taking a measure for task runtime looks like? I expect it to
be a simple read of a variable inside task_struct, maybe with READ_ONCE,
in which case the overhead shouldn't be realistically measurable. Does
it need locking?

The task runtime I am talking about is similar to this:
start = get_system_time(current);
do_io_part();
sq->total_time += get_system_time(current) - start;

Currently, it is not possible to obtain the execution time of a piece of
code by a simple read of a variable inside task_struct.
Or do you have any good ideas?

Jens answered it well

indeed more accurate.  But in fact, our requirements for
accuracy are not particularly high, so after comprehensive

I'm looking at it as a generic feature for everyone, and the
accuracy behaviour is dependent on circumstances. High load
networking spends quite a good share of CPU in softirq, and
preemption would be dependent on config, scheduling, pinning,
etc.

Yes, I quite agree that the accuracy behaviour is dependent on circumstances.
In fact, judging from some test results we have done, the current solution
can basically meet everyone's requirements, and the error in the calculation
result of utilization is estimated to be within 0.5%.

Which sounds more than fine, but there are cases where irqs are
eating up 10s of percents of CPU, which is likely to be more
troublesome.

consideration, we finally chose to use jiffies.
Of course, if you think that a little more performance loss
here has no impact, I can use task run time instead, but in
this case, does the way of calculating sqpoll thread timeout
also need to be changed, because it is also calculated through
jiffies.

That's a good point. It doesn't have to change unless you're
directly inferring the idle time parameter from those two
time values rather than using the ratio. E.g. a simple
bisection of the idle time based on the utilisation metric
shouldn't change. But that definitely raises the question
what idle_time parameter should exactly mean, and what is
more convenient for algorithms.

We think that idle_time represents the time spent by the sqpoll thread
except for submitting IO.

I mean the idle_time parameter, i.e.
struct io_uring_params :: sq_thread_idle, which is how long an SQPOLL
thread should be continuously starved of any work to go to sleep.

For example:
sq_thread_idle = 10ms

  -> 9ms starving -> (do work) -> ...
  -> 9ms starving -> (do work) -> ...
  -> 11ms starving -> (more than idle, go sleep)

And the question was whether to count those delays in wall clock
time, as it currently is, and which is likely to be more natural
for userspace, or otherwise theoretically it could be task local time.

In a ring, it may take time M to submit IO, or it may not submit IO in the
entire cycle. Then we can optimize the efficiency of the sqpoll thread in
two directions. The first is to reduce the number of rings that no IO submit,
The second is to increase the time M to increase the proportion of time
submitted IO in the ring.
In order to observe the CPU ratio of sqthread's actual processing IO part,
we need this patch.

--
Pavel Begunkov