Re: [PATCH v2] io_uring: add IORING_ENTER_NO_IOWAIT to not set in_iowait

David Wei <dw@xxxxxxxxxxx> · Mon, 19 Aug 2024 16:03:48 -0700

On 2024-08-16 18:23, Jeff Moyer wrote:
> Hi, David,
> 
> David Wei <dw@xxxxxxxxxxx> writes:
> 
>> io_uring sets current->in_iowait when waiting for completions, which
>> achieves two things:
>>
>> 1. Proper accounting of the time as iowait time
>> 2. Enable cpufreq optimisations, setting SCHED_CPUFREQ_IOWAIT on the rq
>>
>> For block IO this makes sense as high iowait can be indicative of
>> issues.
> 
> It also let's you know that the system isn't truly idle.  IOW, it would
> be doing some work if it didn't have to wait for I/O.  This was the
> reason the metric was added (admins being confused about why their
> system was showing up idle).

I see. Thanks for the historical context.

> 
>> But for network IO especially recv, the recv side does not control
>> when the completions happen.
>>
>> Some user tooling attributes iowait time as CPU utilisation i.e. not
> 
> What user tooling are you talking about?  If it shows iowait as busy
> time, the tooling is broken.  Please see my last mail on the subject:
>   https://lore.kernel.org/io-uring/x49cz0hxdfa.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

Our internal tooling for example considers CPU util% to be (100 -
idle%), but it also has a CPU busy% defined as (100 - idle% - iowait%).
It is very unfortunate that everyone uses CPU util% for monitoring, with
all sorts of alerts, dashboards and load balancers referring to this
value. One reason is that, depending on context, high iowait time may or
may not be a problem, so it isn't as simple as redefining CPU util% to
exclude iowait.

> 
>> idle, so high iowait time looks like high CPU util even though the task
>> is not scheduled and the CPU is free to run other tasks. When doing
>> network IO with e.g. the batch completion feature, the CPU may appear to
>> have high utilisation.
> 
> Again, iowait is idle time.

That's fair. I think it is simpler to have a single "CPU util" metric
defined as (100 - idle%), and have a switch that userspace explicitly
flips to say "I want iowait to be considered truly idle or not". This
means things such as load balancers can be built around a single metric,
rather than having to consider both util/busy and needing to understand
"does iowait mean anything?".

> 
>> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
>> enter. If set, then current->in_iowait is not set. By default this flag
>> is not set to maintain existing behaviour i.e. in_iowait is always set.
>> This is to prevent waiting for completions being accounted as CPU
>> utilisation.
>>
>> Not setting in_iowait does mean that we also lose cpufreq optimisations
>> above because in_iowait semantics couples 1 and 2 together. Eventually
>> we will untangle the two so the optimisations can be enabled
>> independently of the accounting.
>>
>> IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate
>> support. This will be used by liburing to check for this feature.
> 
> If I receive a problem report where iowait time isn't accurate, I now
> have to somehow figure out if an application is setting this flag.  This
> sounds like a support headache, and I do wonder what the benefit is.
> From what you've written, the justification for the patch is that some
> userspace tooling misinterprets iowait.  Shouldn't we just fix that?

Right, I understand your concerns. That's why by default this flag is
not set and io_uring behaves as before with in_iowait always set.

Unfortunately, "just fix userspace" for us is a huge ask because a whole
pyramid of both code and human understanding has been built on the
current definition of "CPU utilisation". This is extremely time
consuming to change, nor is it something that we (io_uring) want to take
on imo. Why not give the option for people to indicate whether they want
iowait showing up or not?

> 
> It may be that certain (all?) network functions, like recv, should not
> be accounted as iowait.  However, I don't think the onus should be on
> applications to tell the kernel about that--the kernel should just
> figure that out on its own.
> 
> Am I alone in these opinions?

Why should the onus be on the kernel? I think it is more difficult for
the kernel to figure out exactly what semantics userspace wants and it
is simpler for userspace to select their preference.