Re: fsync hangs after scsi rejected a request

Florian Stecker <m19@xxxxxxxxxxxxxxxxx> · Fri, 25 Jan 2019 01:45:54 +0100

On 1/22/19 4:22 AM, Ming Lei wrote:
On Tue, Jan 22, 2019 at 5:13 AM Florian Stecker <m19@xxxxxxxxxxxxxxxxx> wrote:

Hi everyone,

on my laptop, I am experiencing occasional hangs of applications during
fsync(), which are sometimes up to 30 seconds long. I'm using a BTRFS
which spans two partitions on the same SSD (one of them used to contain
a Windows, but I removed it and added the partition to the BTRFS volume
instead). Also, the problem only occurs when an I/O scheduler
(mq-deadline) is in use. I'm running kernel version 4.20.3.

  From what I understand so far, what happens is that a sync request
fails in the SCSI/ATA layer, in ata_std_qc_defer(), because it is a
"Non-NCQ command" and can not be queued together with other commands.
This propagates up into blk_mq_dispatch_rq_list(), where the call

ret = q->mq_ops->queue_rq(hctx, &bd);

returns BLK_STS_DEV_RESOURCE. Later in blk_mq_dispatch_rq_list(), there
is the piece of code

needs_restart = blk_mq_sched_needs_restart(hctx);
if (!needs_restart ||
         (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
         blk_mq_run_hw_queue(hctx, true);
else if (needs_restart && (ret == BLK_STS_RESOURCE))
         blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);

which restarts the queue after a delay if BLK_STS_RESOURCE was returned,
but somehow not for BLK_STS_DEV_RESOURCE. Instead, nothing happens and
fsync() seems to hang until some other process wants to do I/O.

So if I do

- else if (needs_restart && (ret == BLK_STS_RESOURCE))
+ else if (needs_restart && (ret == BLK_STS_RESOURCE || ret ==
BLK_STS_DEV_RESOURCE))

it fixes my problem. But was there a reason why BLK_STS_DEV_RESOURCE was
treated differently that BLK_STS_RESOURCE here?

Please see the comment:

/*
  * BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if
  * device related resources are unavailable, but the driver can guarantee
  * that the queue will be rerun in the future once resources become
  * available again. This is typically the case for device specific
  * resources that are consumed for IO. If the driver fails allocating these
  * resources, we know that inflight (or pending) IO will free these
  * resource upon completion.
  *
  * This is different from BLK_STS_RESOURCE in that it explicitly references
  * a device specific resource. For resources of wider scope, allocation
  * failure can happen without having pending IO. This means that we can't
  * rely on request completions freeing these resources, as IO may not be in
  * flight. Examples of that are kernel memory allocations, DMA mappings, or
  * any other system wide resources.
  */
#define BLK_STS_DEV_RESOURCE    ((__force blk_status_t)13)

In any case, it seems wrong to me that ret is used here at all, as it
just contains the return value of the last request in the list, and
whether we rerun the queue should probably not only depend on the last
request?

Can anyone of the experts tell me whether this makes sense or I got
something completely wrong?

Sounds a bug in SCSI or ata driver.

I remember there is hole in SCSI wrt. returning BLK_STS_DEV_RESOURCE,
but I never get lucky to reproduce it.

scsi_queue_rq():
         ......
         case BLK_STS_RESOURCE:
                 if (atomic_read(&sdev->device_busy) ||
                     scsi_device_blocked(sdev))
                         ret = BLK_STS_DEV_RESOURCE;

OK, please tell me if I understand this right now. What is supposed to 
happen is

- request 1 is being processed by the device
- request 2 (sync) fails with STS_DEV_RESOURCE because request 1 is 
still being processed
- request 2 is put into hctx->dispatch inside blk_mq_dispatch_rq_list *
- queue does not get rerun because of STS_DEV_RESOURCE
- request 1 finishes
- scsi_end_request calls blk_mq_run_hw_queues
- blk_mq_run_hw_queue finds request 2 in hctx->dispatch **
- runs request 2

However, what happens for me is that request 1 finishes earlier, so that 
** is executed before *, and finds an empty hctx->dispatch list.

At least this is consistent with what I see.

> All in-flight request may complete between reading 'sdev->device_busy'
> and setting ret as 'BLK_STS_DEV_RESOURCE', then this IO hang may
> be triggered.
>

More accurate would be: between reading sdev->device_busy and doing 
list_splice_init(list,&hctx->dispatch) inside blk_mq_dispatch_rq_list.

I'm not sure how the SCSI driver can do anything about this except 
returning BLK_STS_RESOURCE instead (which I guess would not be a great 
solution), as basically everything else happens inside blk-mq. Or what 
do you have in mind?