Re: fsync hangs after scsi rejected a request

Ming Lei <tom.leiming@xxxxxxxxx> · Tue, 22 Jan 2019 11:22:02 +0800

On Tue, Jan 22, 2019 at 5:13 AM Florian Stecker <m19@xxxxxxxxxxxxxxxxx> wrote:
>
> Hi everyone,
>
> on my laptop, I am experiencing occasional hangs of applications during
> fsync(), which are sometimes up to 30 seconds long. I'm using a BTRFS
> which spans two partitions on the same SSD (one of them used to contain
> a Windows, but I removed it and added the partition to the BTRFS volume
> instead). Also, the problem only occurs when an I/O scheduler
> (mq-deadline) is in use. I'm running kernel version 4.20.3.
>
>  From what I understand so far, what happens is that a sync request
> fails in the SCSI/ATA layer, in ata_std_qc_defer(), because it is a
> "Non-NCQ command" and can not be queued together with other commands.
> This propagates up into blk_mq_dispatch_rq_list(), where the call
>
> ret = q->mq_ops->queue_rq(hctx, &bd);
>
> returns BLK_STS_DEV_RESOURCE. Later in blk_mq_dispatch_rq_list(), there
> is the piece of code
>
> needs_restart = blk_mq_sched_needs_restart(hctx);
> if (!needs_restart ||
>         (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>         blk_mq_run_hw_queue(hctx, true);
> else if (needs_restart && (ret == BLK_STS_RESOURCE))
>         blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
>
> which restarts the queue after a delay if BLK_STS_RESOURCE was returned,
> but somehow not for BLK_STS_DEV_RESOURCE. Instead, nothing happens and
> fsync() seems to hang until some other process wants to do I/O.
>
> So if I do
>
> - else if (needs_restart && (ret == BLK_STS_RESOURCE))
> + else if (needs_restart && (ret == BLK_STS_RESOURCE || ret ==
> BLK_STS_DEV_RESOURCE))
>
> it fixes my problem. But was there a reason why BLK_STS_DEV_RESOURCE was
> treated differently that BLK_STS_RESOURCE here?

Please see the comment:

/*
 * BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if
 * device related resources are unavailable, but the driver can guarantee
 * that the queue will be rerun in the future once resources become
 * available again. This is typically the case for device specific
 * resources that are consumed for IO. If the driver fails allocating these
 * resources, we know that inflight (or pending) IO will free these
 * resource upon completion.
 *
 * This is different from BLK_STS_RESOURCE in that it explicitly references
 * a device specific resource. For resources of wider scope, allocation
 * failure can happen without having pending IO. This means that we can't
 * rely on request completions freeing these resources, as IO may not be in
 * flight. Examples of that are kernel memory allocations, DMA mappings, or
 * any other system wide resources.
 */
#define BLK_STS_DEV_RESOURCE    ((__force blk_status_t)13)

>
> In any case, it seems wrong to me that ret is used here at all, as it
> just contains the return value of the last request in the list, and
> whether we rerun the queue should probably not only depend on the last
> request?
>
> Can anyone of the experts tell me whether this makes sense or I got
> something completely wrong?

Sounds a bug in SCSI or ata driver.

I remember there is hole in SCSI wrt. returning BLK_STS_DEV_RESOURCE,
but I never get lucky to reproduce it.

scsi_queue_rq():
        ......
        case BLK_STS_RESOURCE:
                if (atomic_read(&sdev->device_busy) ||
                    scsi_device_blocked(sdev))
                        ret = BLK_STS_DEV_RESOURCE;

All in-flight request may complete between reading 'sdev->device_busy'
and setting ret as 'BLK_STS_DEV_RESOURCE', then this IO hang may
be triggered.

Thanks,
Ming Lei