Re: [PATCH] blk-mq: Improvements to the hybrid polling sleep time calculation

"Stephen Bates" <sbates@xxxxxxxxxxxx> · Tue, 29 Aug 2017 15:33:15 +0000

>> From: Stephen Bates <sbates@xxxxxxxxxxxx>
>> 
>> Hybrid polling currently uses half the average completion time as an
>> estimate of how long to poll for. We can improve upon this by noting
>> that polling before the minimum completion time makes no sense. Add a
>> sysfs entry to use this fact to improve CPU utilization in certain
>> cases.
>> 
>> At the same time the minimum is a bit too long to sleep for since we
>> must factor in OS wake time for the thread. For now allow the user to
>> set this via a second sysfs entry (in nanoseconds).
>> 
>> Testing this patch on Intel Optane SSDs showed that using the minimum
>> rather than half reduced CPU utilization from 59% to 38%. Tuning
>> this via the wake time adjustment allowed us to trade CPU load for
>> latency. For example
>> 
>> io_poll	 delay	hyb_use_min adjust	latency	CPU load
>> 1	 -1	N/A	    N/A		8.4	100%
>> 1	 0	0	    N/A		8.4	57%
>> 1	 0	1	    0		10.3	34%
>> 1	 9	1	    1000	9.9	37%
>> 1	 0	1	    2000	8.4	47%
>> 1	 0	1	    10000	8.4	100%
>> 
>> Ideally we will extend this to auto-calculate the wake time rather
>> than have it set by the user.
>
> I don't like this, it's another weird knob that will exist but that
> no one will know how to use. For most of the testing I've done
> recently, hybrid is a win over busy polling - hence I think we should
> make that the default. 60% of mean has also, in testing, been shown
> to be a win. So that's an easy fix/change we can consider.

I do agree that the this is a hard knob to tune. I am however not happy that the current hybrid default may mean we are polling well before the minimum completion time. That just seems like a waste of CPU resources to me. I do agree that turning on hybrid as the default and perhaps bumping up the default is a good idea.

> To go beyond that, I'd much rather see us tracking the time waste.
> If we consider the total completion time of an IO to be A+B+C, where:
>
> A	Time needed to go to sleep
> B	Sleep time
> C	Time needed to wake up
>
> then we could feasibly track A+C. We already know how long the IO
> will take to complete, as we track that. At that point we'd have
> a full picture of how long we should sleep.

Yes, this is where I was thinking of taking this functionality in the long term. It seems like tracking C is something other parts of the kernel might need. Does anyone know of any existing code in this space?

> Bonus points for informing the lower level scheduler of this as
> well. If the CPU is going idle, we'll enter some sort of power
> state in the processor. If we were able to pass in how long we
> expect to sleep, we could be making better decisions here.

Yup. Again, this seems like something more general that just the block-layer. I will do some digging and see/if anything is available to leverage here.

Cheers
Stephen