Re: fsync hangs after scsi rejected a request

Ming Lei <tom.leiming@xxxxxxxxx> · Fri, 25 Jan 2019 09:33:24 +0800

On Fri, Jan 25, 2019 at 8:45 AM Florian Stecker <m19@xxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On 1/22/19 4:22 AM, Ming Lei wrote:
> > On Tue, Jan 22, 2019 at 5:13 AM Florian Stecker <m19@xxxxxxxxxxxxxxxxx> wrote:
> >>
> >> Hi everyone,
> >>
> >> on my laptop, I am experiencing occasional hangs of applications during
> >> fsync(), which are sometimes up to 30 seconds long. I'm using a BTRFS
> >> which spans two partitions on the same SSD (one of them used to contain
> >> a Windows, but I removed it and added the partition to the BTRFS volume
> >> instead). Also, the problem only occurs when an I/O scheduler
> >> (mq-deadline) is in use. I'm running kernel version 4.20.3.
> >>
> >>   From what I understand so far, what happens is that a sync request
> >> fails in the SCSI/ATA layer, in ata_std_qc_defer(), because it is a
> >> "Non-NCQ command" and can not be queued together with other commands.
> >> This propagates up into blk_mq_dispatch_rq_list(), where the call
> >>
> >> ret = q->mq_ops->queue_rq(hctx, &bd);
> >>
> >> returns BLK_STS_DEV_RESOURCE. Later in blk_mq_dispatch_rq_list(), there
> >> is the piece of code
> >>
> >> needs_restart = blk_mq_sched_needs_restart(hctx);
> >> if (!needs_restart ||
> >>          (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> >>          blk_mq_run_hw_queue(hctx, true);
> >> else if (needs_restart && (ret == BLK_STS_RESOURCE))
> >>          blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> >>
> >> which restarts the queue after a delay if BLK_STS_RESOURCE was returned,
> >> but somehow not for BLK_STS_DEV_RESOURCE. Instead, nothing happens and
> >> fsync() seems to hang until some other process wants to do I/O.
> >>
> >> So if I do
> >>
> >> - else if (needs_restart && (ret == BLK_STS_RESOURCE))
> >> + else if (needs_restart && (ret == BLK_STS_RESOURCE || ret ==
> >> BLK_STS_DEV_RESOURCE))
> >>
> >> it fixes my problem. But was there a reason why BLK_STS_DEV_RESOURCE was
> >> treated differently that BLK_STS_RESOURCE here?
> >
> > Please see the comment:
> >
> > /*
> >   * BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if
> >   * device related resources are unavailable, but the driver can guarantee
> >   * that the queue will be rerun in the future once resources become
> >   * available again. This is typically the case for device specific
> >   * resources that are consumed for IO. If the driver fails allocating these
> >   * resources, we know that inflight (or pending) IO will free these
> >   * resource upon completion.
> >   *
> >   * This is different from BLK_STS_RESOURCE in that it explicitly references
> >   * a device specific resource. For resources of wider scope, allocation
> >   * failure can happen without having pending IO. This means that we can't
> >   * rely on request completions freeing these resources, as IO may not be in
> >   * flight. Examples of that are kernel memory allocations, DMA mappings, or
> >   * any other system wide resources.
> >   */
> > #define BLK_STS_DEV_RESOURCE    ((__force blk_status_t)13)
> >
> >>
> >> In any case, it seems wrong to me that ret is used here at all, as it
> >> just contains the return value of the last request in the list, and
> >> whether we rerun the queue should probably not only depend on the last
> >> request?
> >>
> >> Can anyone of the experts tell me whether this makes sense or I got
> >> something completely wrong?
> >
> > Sounds a bug in SCSI or ata driver.
> >
> > I remember there is hole in SCSI wrt. returning BLK_STS_DEV_RESOURCE,
> > but I never get lucky to reproduce it.
> >
> > scsi_queue_rq():
> >          ......
> >          case BLK_STS_RESOURCE:
> >                  if (atomic_read(&sdev->device_busy) ||
> >                      scsi_device_blocked(sdev))
> >                          ret = BLK_STS_DEV_RESOURCE;
> >
>
> OK, please tell me if I understand this right now. What is supposed to
> happen is
>
> - request 1 is being processed by the device
> - request 2 (sync) fails with STS_DEV_RESOURCE because request 1 is
> still being processed
> - request 2 is put into hctx->dispatch inside blk_mq_dispatch_rq_list *
> - queue does not get rerun because of STS_DEV_RESOURCE
> - request 1 finishes
> - scsi_end_request calls blk_mq_run_hw_queues
> - blk_mq_run_hw_queue finds request 2 in hctx->dispatch **
> - runs request 2

Right, it is one typical case if the hw queue depth is 2.

>
> However, what happens for me is that request 1 finishes earlier, so that
> ** is executed before *, and finds an empty hctx->dispatch list.
>
> At least this is consistent with what I see.
>
>  > All in-flight request may complete between reading 'sdev->device_busy'
>  > and setting ret as 'BLK_STS_DEV_RESOURCE', then this IO hang may
>  > be triggered.
>  >
>
> More accurate would be: between reading sdev->device_busy and doing
> list_splice_init(list,&hctx->dispatch) inside blk_mq_dispatch_rq_list.

Exactly.

>
> I'm not sure how the SCSI driver can do anything about this except
> returning BLK_STS_RESOURCE instead (which I guess would not be a great
> solution), as basically everything else happens inside blk-mq. Or what
> do you have in mind?

Not have a better idea yet, :-(

thanks,
Ming Lei