Re: [EXT] Re: [PATCH v5 4/4] ufs: Simplify the clock scaling mechanism implementation

Bart Van Assche <bvanassche@xxxxxxx> · Tue, 19 Nov 2019 15:16:24 -0800

On 11/18/19 9:33 PM, cang@xxxxxxxxxxxxxx wrote:
Aside the functionality, we need to take this change carefully as it may 
relate with performance.
With the old implementation, when devfreq decides to scale up, we can 
make sure that all requests
sent to UFS device after this point will go through the highest UFS 
Gear. Scaling up will happen
right after doorbell is cleared, not even need to wait till the requests 
are freed from block layer.
And 1 sec is long enough to clean out the doorbell by HW.

In the new implementation of ufshcd_clock_scaling_prepare(), after 
blk_freeze_queue_start(), we call
blk_mq_freeze_queue_wait_timeout() to wait for 1 sec. In addition to 
those requests which have already
been queued to HW doorbell, more requests will be dispatched within 1 
sec, through the lowest Gear.
My first impression is that it might be a bit lazy and I am not sure how 
much it may affect the benchmarks.

And if the request queues are heavily loaded(many bios are waiting for a 
free tag to become a request),
is 1 sec long enough to freeze all these request queues? If no, 
performance would be affected if scaling up
fails frequently.
As per my understanding, the request queue will become frozen only after 
all the existing requests and
bios(which are already waiting for a free tag on the queue) to be 
completed and freed from block layer.
Please correct me if I am wrong.

While, in the old implementation, when devfreq decides to scale up, 
scaling can always
happen for majority chances, except for those unexpected HW issues.

Hi Can,

I agree with the description above of the current behavior of the UFS 
driver. But I do not agree with the interpretation of the behavior of 
the changes I proposed. The implementation in the block layer of request 
queue freezing is as follows:

void blk_freeze_queue_start(struct request_queue *q)
{
	mutex_lock(&q->mq_freeze_lock);
	if (++q->mq_freeze_depth == 1) {
		percpu_ref_kill(&q->q_usage_counter);
		mutex_unlock(&q->mq_freeze_lock);
		if (queue_is_mq(q))
			blk_mq_run_hw_queues(q, false);
	} else {
		mutex_unlock(&q->mq_freeze_lock);
	}
}

As long as calls from other threads to 
blk_mq_freeze_queue()/blk_mq_unfreeze_queue() are balanced, after 
blk_freeze_queue_start() returns it is guaranteed that 
q->q_usage_counter is in __PERCPU_REF_DEAD mode. Calls to 
blk_get_request() etc. block in that mode until the request queue is 
unfrozen. See also blk_queue_enter().

In other words, calling blk_freeze_queue() from inside 
ufshcd_clock_scaling_prepare() should preserve the current behavior, 
namely that ufshcd_clock_scaling_prepare() waits for ongoing requests to 
finish and also that submission of new requests is postponed until clock 
scaling has finished. Do you agree with this analysis?

Thanks,

Bart.