Re: [PATCH v6 4/4] ufs: Simplify the clock scaling mechanism implementation

Douglas Gilbert <dgilbert@xxxxxxxxxxxx> · Tue, 26 Nov 2019 16:00:58 +1100

On 2019-11-26 12:05 p.m., Bart Van Assche wrote:
On 11/24/19 11:38 PM, cang@xxxxxxxxxxxxxx wrote:
Sorry to bring back the old discussion we had on V5 of this series here.
I spent some time understanding the mq freeze queue implementation but I
still have the same concern regards the your new implementation of clock
scaling prepare/unprepare.

I attached two logs I collected on my setups(the kernel is 5.4).

In log1.txt, I pretty much ported your changes to our code base and copy
a large file to partition /dev/sde1(mounted to /mnt). You can see the
debug logs I added show that the -ETIMEDOUT returned from ufshcd_devfreq_scale(),
although scaling UP/DOWN succeed some times.

To make it more controllable, I made a new sysfs entry, namely freeze_sde_queue,
so that I can freeze the queue of /dev/sde whenever I do "echo 1 > 
freeze_sde_queue".
After call blk_freeze_queue_start(), call blk_mq_freeze_queue_wait_timeout() 
to wait for 1 sec.
In the end, blk_mq_unfreeze_queue() would be called no matter timeout happened 
or not.

I also added a flag, called enable_print, to request_queue, and set it after I 
call
blk_freeze_queue_start(). Block layer, when this flag is set, will print logs 
from
blk_mq_start_request() and blk_mq_get_request().

I tried multiple times during copy large a file to partition /dev/sde1(mounted 
to /mnt).

In log2.txt, there are still wait timeout, and you can see after 
blk_freeze_queue_start()
is called for request queue of /dev/sde, there still can be requests sent down 
to UFS driver
and these requests are the "latency" that I mentioned in our previous 
discussion. I know
these requests are not "new" requests reach block layer after freeze starts,  
but the ones
in the queue before freeze starts. However, they are the difference from the 
old implementation.
When scaling up should happen, these requests are still sent through the 
lowest HS Gear,
while when scaling down should happen, these requests are sent through the 
highest HS Gear.
The former may cause performance issue while the later may cause power 
penalty, this is our
major concern.

Hi Can,

Thank you for having taken the time to run these tests and to share the output
of these tests. If I understood you correctly the patch I posted works but
introduces a performance regression, namely that it takes longer to suspend
request processing for an UFS host. How about replacing patch 4/4 with the
patch below?

Thanks,

Bart.

Subject: [PATCH] ufs: Make clock scaling more robust

Scaling the clock is only safe while no commands are in progress. The
current clock scaling implementation uses hba->clk_scaling_lock to
serialize clock scaling against the following three functions:
* ufshcd_queuecommand()          (uses sdev->request_queue)
* ufshcd_exec_dev_cmd()          (uses hba->cmd_queue)
* ufshcd_issue_devman_upiu_cmd() (uses hba->cmd_queue)

__ufshcd_issue_tm_cmd(), which uses hba->tmf_queue, is not yet serialized
against clock scaling. Disallowing that TMFs are submitted as follows from
user space while clock scaling is in progress:

submit UPIU_TRANSACTION_TASK_REQ on an UFS BSG queue
-> ufs_bsg_request()
   -> ufshcd_exec_raw_upiu_cmd(msgcode = UPIU_TRANSACTION_TASK_REQ)
     -> ufshcd_exec_raw_upiu_cmd()
       -> __ufshcd_issue_tm_cmd()

Cc: Can Guo <cang@xxxxxxxxxxxxxx>
Cc: Bean Huo <beanhuo@xxxxxxxxxx>
Cc: Avri Altman <avri.altman@xxxxxxx>
Cc: Stanley Chu <stanley.chu@xxxxxxxxxxxx>
Cc: Tomas Winkler <tomas.winkler@xxxxxxxxx>
Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx>
---
  drivers/scsi/ufs/ufshcd.c | 36 +++++++++++++++++++++++++-----------
  1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 99337e0b54f6..631c588f279a 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -972,14 +972,13 @@ static bool ufshcd_is_devfreq_scaling_required(struct 
ufs_hba *hba,
  }

  static int ufshcd_wait_for_doorbell_clr(struct ufs_hba *hba,
-                    u64 wait_timeout_us)
+                    unsigned long deadline)
  {
      unsigned long flags;
      int ret = 0;
      u32 tm_doorbell;
      u32 tr_doorbell;
      bool timeout = false, do_last_check = false;
-    ktime_t start;

      ufshcd_hold(hba, false);
      spin_lock_irqsave(hba->host->host_lock, flags);
@@ -987,7 +986,6 @@ static int ufshcd_wait_for_doorbell_clr(struct ufs_hba *hba,
       * Wait for all the outstanding tasks/transfer requests.
       * Verify by checking the doorbell registers are clear.
       */
-    start = ktime_get();

Rather than throw away the high precision clock and resort to jiffies, perhaps
you might try the middle road. For example: ktime_get_coarse().

Doug Gilbert