Re: fsync hangs after scsi rejected a request

"jianchao.wang" <jianchao.w.wang@xxxxxxxxxx> · Fri, 25 Jan 2019 11:49:45 +0800

On 1/25/19 9:33 AM, Ming Lei wrote:
> On Fri, Jan 25, 2019 at 8:45 AM Florian Stecker <m19@xxxxxxxxxxxxxxxxx> wrote:
>>
>>
>>
>> On 1/22/19 4:22 AM, Ming Lei wrote:
>>> On Tue, Jan 22, 2019 at 5:13 AM Florian Stecker <m19@xxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> on my laptop, I am experiencing occasional hangs of applications during
>>>> fsync(), which are sometimes up to 30 seconds long. I'm using a BTRFS
>>>> which spans two partitions on the same SSD (one of them used to contain
>>>> a Windows, but I removed it and added the partition to the BTRFS volume
>>>> instead). Also, the problem only occurs when an I/O scheduler
>>>> (mq-deadline) is in use. I'm running kernel version 4.20.3.
>>>>
>>>>   From what I understand so far, what happens is that a sync request
>>>> fails in the SCSI/ATA layer, in ata_std_qc_defer(), because it is a
>>>> "Non-NCQ command" and can not be queued together with other commands.
>>>> This propagates up into blk_mq_dispatch_rq_list(), where the call
>>>>
>>>> ret = q->mq_ops->queue_rq(hctx, &bd);
>>>>
>>>> returns BLK_STS_DEV_RESOURCE. Later in blk_mq_dispatch_rq_list(), there
>>>> is the piece of code
>>>>
>>>> needs_restart = blk_mq_sched_needs_restart(hctx);
>>>> if (!needs_restart ||
>>>>          (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>>>>          blk_mq_run_hw_queue(hctx, true);
>>>> else if (needs_restart && (ret == BLK_STS_RESOURCE))
>>>>          blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
>>>>
>>>> which restarts the queue after a delay if BLK_STS_RESOURCE was returned,
>>>> but somehow not for BLK_STS_DEV_RESOURCE. Instead, nothing happens and
>>>> fsync() seems to hang until some other process wants to do I/O.
>>>>
>>>> So if I do
>>>>
>>>> - else if (needs_restart && (ret == BLK_STS_RESOURCE))
>>>> + else if (needs_restart && (ret == BLK_STS_RESOURCE || ret ==
>>>> BLK_STS_DEV_RESOURCE))
>>>>
>>>> it fixes my problem. But was there a reason why BLK_STS_DEV_RESOURCE was
>>>> treated differently that BLK_STS_RESOURCE here?
>>>
>>> Please see the comment:
>>>
>>> /*
>>>   * BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if
>>>   * device related resources are unavailable, but the driver can guarantee
>>>   * that the queue will be rerun in the future once resources become
>>>   * available again. This is typically the case for device specific
>>>   * resources that are consumed for IO. If the driver fails allocating these
>>>   * resources, we know that inflight (or pending) IO will free these
>>>   * resource upon completion.
>>>   *
>>>   * This is different from BLK_STS_RESOURCE in that it explicitly references
>>>   * a device specific resource. For resources of wider scope, allocation
>>>   * failure can happen without having pending IO. This means that we can't
>>>   * rely on request completions freeing these resources, as IO may not be in
>>>   * flight. Examples of that are kernel memory allocations, DMA mappings, or
>>>   * any other system wide resources.
>>>   */
>>> #define BLK_STS_DEV_RESOURCE    ((__force blk_status_t)13)
>>>
>>>>
>>>> In any case, it seems wrong to me that ret is used here at all, as it
>>>> just contains the return value of the last request in the list, and
>>>> whether we rerun the queue should probably not only depend on the last
>>>> request?
>>>>
>>>> Can anyone of the experts tell me whether this makes sense or I got
>>>> something completely wrong?
>>>
>>> Sounds a bug in SCSI or ata driver.
>>>
>>> I remember there is hole in SCSI wrt. returning BLK_STS_DEV_RESOURCE,
>>> but I never get lucky to reproduce it.
>>>
>>> scsi_queue_rq():
>>>          ......
>>>          case BLK_STS_RESOURCE:
>>>                  if (atomic_read(&sdev->device_busy) ||
>>>                      scsi_device_blocked(sdev))
>>>                          ret = BLK_STS_DEV_RESOURCE;
>>>
>>
>> OK, please tell me if I understand this right now. What is supposed to
>> happen is
>>
>> - request 1 is being processed by the device
>> - request 2 (sync) fails with STS_DEV_RESOURCE because request 1 is
>> still being processed
>> - request 2 is put into hctx->dispatch inside blk_mq_dispatch_rq_list *
>> - queue does not get rerun because of STS_DEV_RESOURCE
>> - request 1 finishes
>> - scsi_end_request calls blk_mq_run_hw_queues
>> - blk_mq_run_hw_queue finds request 2 in hctx->dispatch **
>> - runs request 2
> 
> Right, it is one typical case if the hw queue depth is 2.
> 
>>
>> However, what happens for me is that request 1 finishes earlier, so that
>> ** is executed before *, and finds an empty hctx->dispatch list.
>>
>> At least this is consistent with what I see.
>>
>>  > All in-flight request may complete between reading 'sdev->device_busy'
>>  > and setting ret as 'BLK_STS_DEV_RESOURCE', then this IO hang may
>>  > be triggered.
>>  >
>>
>> More accurate would be: between reading sdev->device_busy and doing
>> list_splice_init(list,&hctx->dispatch) inside blk_mq_dispatch_rq_list.
> 
> Exactly.

It sounds like not so easy to trigger.

blk_mq_dispatch_rq_list
  scsi_queue_rq
     if (atomic_read(&sdev->device_busy) ||
       scsi_device_blocked(sdev))
       ret = BLK_STS_DEV_RESOURCE;             scsi_end_request
                                                 __blk_mq_end_request
                                                   blk_mq_sched_restart // clear RESTART
                                                     blk_mq_run_hw_queue
                                                 blk_mq_run_hw_queues

  list_splice_init(list, &hctx->dispatch)
  needs_restart = blk_mq_sched_needs_restart(hctx)

The 'needs_restart' will be false, so the queue would be rerun.

Thanks
Jianchao

> 
>>
>> I'm not sure how the SCSI driver can do anything about this except
>> returning BLK_STS_RESOURCE instead (which I guess would not be a great
>> solution), as basically everything else happens inside blk-mq. Or what
>> do you have in mind?
> 
> Not have a better idea yet, :-(
> 
> thanks,
> Ming Lei
>