Re: [PATCH v2] io_uring: add IORING_ENTER_NO_IOWAIT to not set in_iowait

Pavel Begunkov <asml.silence@xxxxxxxxx> · Sun, 18 Aug 2024 03:27:38 +0100

On 8/18/24 02:08, Jens Axboe wrote:
On 8/17/24 4:04 PM, Pavel Begunkov wrote:
On 8/17/24 22:09, Jens Axboe wrote:
On 8/17/24 3:05 PM, Pavel Begunkov wrote:
On 8/17/24 21:20, Jens Axboe wrote:
On 8/17/24 1:44 PM, Pavel Begunkov wrote:
This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
enter. If set, then current->in_iowait is not set. By default this flag
...
And that "use case" for iowait directly linked to cpufreq, so
if it still counts, then we shouldn't be separating stats from
cpufreq at all.

This is what the cpufreq people want to do anyway, so it'll probably
happen whether we like it or not.

Not against it, quite the opposite

case, yet I think we should cater to it as it very well could be legit,
just in the tiny minority of cases.

I explained why it's a confusing feature. We can make up some niche
case (with enough of imagination we can justify basically anything),
but I explained why IMHO accounting flag (let's forget about
cpufreq) would have net negative effect. A sysctl knob would be
much more reasonable, but I don't think it's needed at all.

The main thing for me is policy vs flexibility. The fact that boost and
iowait accounting is currently tied together is pretty ugly and will
hopefully go away with my patches.

It's really simple for this stuff - the freq boost is useful (and
needed) for some workloads, and the iowait accounting is never useful
for anything but (currently) comes as an unfortunate side effect of the
former. But even with those two separated, there are still going to be
cases where you want to control when it happens.

You can imagine such cases, but in reality I doubt it. If we
disable the stat part, nobody would notice as nobody cared for
last 3-4 years before in_iowait was added.

That would be ideal. You're saying Jamal's complaint was purely iowait
based? Because it looked like power concerns to me... If it's just
iowait, then they just need to stop looking at that, that's pretty
simple.

Power consumption, and then, in search of what's wrong, it was
correlated to high iowait as well as difference in C state stats.

But this means that it was indeed power consumption, and iowait was just
the canary in the coal mine that lead them down the right path.

And this in turn means that even with the split, we want to
differentiate between short/busty sleeps and longer ones.

That's what I've been talking about since a couple of months ago,
for networking we have a well measured energy consumption
regression because of iowait, not like we can just leave it as
it is now. And For the lack of a good way to auto tune in the
kernel, an enter flag (described as a performance feature) looks
good, I agree.

...

The name might also be confusing. We need an explanation when
it could be useful, and name it accordingly. DEEP/SHALLOW_WAIT?
Do you remember how cpufreq accounts for it?

I don't remember how it accounts for it, and was just pondering that
with the above reply. Because if it just decays the sleep state, then
you could just use it generically. If it stays high regardless of how
long you wait, then it could be a power issue. Not on servers really
(well a bit, depending on boosting), but more so on desktop apps.
Laptops tend to be pretty power conservative!

{SHORT,BRIEF}/LONG_WAIT maybe?

I think that's a lot more descriptive. Ideally we'd want to tie this to
wakeup latencies, eg we'd need to know about wakeup latencies. For
example, if the user asks for a 100 usec wait, we'd want to influence
what sleep state is picked in propagating that information. Things like
the min-wait I posted would directly work for that, as it tells the
story in two chapters on what waits we're expecting here. Currently
there's no way to do that (hence iowait -> cpufreq boosting), but there
clearly should be. Or even without min-wait, the timeout is clearly
known here, combined with the expected/desired number of events the
application is looking for.

Yeah, interesting, we can auto apply it depending on the delta
time, etc. Might worth to ask the cpufreq guys about thresholds.

--
Pavel Begunkov