On Fri, Jan 25, 2019 at 8:45 AM Florian Stecker <m19@xxxxxxxxxxxxxxxxx> wrote: > > > > On 1/22/19 4:22 AM, Ming Lei wrote: > > On Tue, Jan 22, 2019 at 5:13 AM Florian Stecker <m19@xxxxxxxxxxxxxxxxx> wrote: > >> > >> Hi everyone, > >> > >> on my laptop, I am experiencing occasional hangs of applications during > >> fsync(), which are sometimes up to 30 seconds long. I'm using a BTRFS > >> which spans two partitions on the same SSD (one of them used to contain > >> a Windows, but I removed it and added the partition to the BTRFS volume > >> instead). Also, the problem only occurs when an I/O scheduler > >> (mq-deadline) is in use. I'm running kernel version 4.20.3. > >> > >> From what I understand so far, what happens is that a sync request > >> fails in the SCSI/ATA layer, in ata_std_qc_defer(), because it is a > >> "Non-NCQ command" and can not be queued together with other commands. > >> This propagates up into blk_mq_dispatch_rq_list(), where the call > >> > >> ret = q->mq_ops->queue_rq(hctx, &bd); > >> > >> returns BLK_STS_DEV_RESOURCE. Later in blk_mq_dispatch_rq_list(), there > >> is the piece of code > >> > >> needs_restart = blk_mq_sched_needs_restart(hctx); > >> if (!needs_restart || > >> (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) > >> blk_mq_run_hw_queue(hctx, true); > >> else if (needs_restart && (ret == BLK_STS_RESOURCE)) > >> blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY); > >> > >> which restarts the queue after a delay if BLK_STS_RESOURCE was returned, > >> but somehow not for BLK_STS_DEV_RESOURCE. Instead, nothing happens and > >> fsync() seems to hang until some other process wants to do I/O. > >> > >> So if I do > >> > >> - else if (needs_restart && (ret == BLK_STS_RESOURCE)) > >> + else if (needs_restart && (ret == BLK_STS_RESOURCE || ret == > >> BLK_STS_DEV_RESOURCE)) > >> > >> it fixes my problem. But was there a reason why BLK_STS_DEV_RESOURCE was > >> treated differently that BLK_STS_RESOURCE here? > > > > Please see the comment: > > > > /* > > * BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if > > * device related resources are unavailable, but the driver can guarantee > > * that the queue will be rerun in the future once resources become > > * available again. This is typically the case for device specific > > * resources that are consumed for IO. If the driver fails allocating these > > * resources, we know that inflight (or pending) IO will free these > > * resource upon completion. > > * > > * This is different from BLK_STS_RESOURCE in that it explicitly references > > * a device specific resource. For resources of wider scope, allocation > > * failure can happen without having pending IO. This means that we can't > > * rely on request completions freeing these resources, as IO may not be in > > * flight. Examples of that are kernel memory allocations, DMA mappings, or > > * any other system wide resources. > > */ > > #define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) > > > >> > >> In any case, it seems wrong to me that ret is used here at all, as it > >> just contains the return value of the last request in the list, and > >> whether we rerun the queue should probably not only depend on the last > >> request? > >> > >> Can anyone of the experts tell me whether this makes sense or I got > >> something completely wrong? > > > > Sounds a bug in SCSI or ata driver. > > > > I remember there is hole in SCSI wrt. returning BLK_STS_DEV_RESOURCE, > > but I never get lucky to reproduce it. > > > > scsi_queue_rq(): > > ...... > > case BLK_STS_RESOURCE: > > if (atomic_read(&sdev->device_busy) || > > scsi_device_blocked(sdev)) > > ret = BLK_STS_DEV_RESOURCE; > > > > OK, please tell me if I understand this right now. What is supposed to > happen is > > - request 1 is being processed by the device > - request 2 (sync) fails with STS_DEV_RESOURCE because request 1 is > still being processed > - request 2 is put into hctx->dispatch inside blk_mq_dispatch_rq_list * > - queue does not get rerun because of STS_DEV_RESOURCE > - request 1 finishes > - scsi_end_request calls blk_mq_run_hw_queues > - blk_mq_run_hw_queue finds request 2 in hctx->dispatch ** > - runs request 2 Right, it is one typical case if the hw queue depth is 2. > > However, what happens for me is that request 1 finishes earlier, so that > ** is executed before *, and finds an empty hctx->dispatch list. > > At least this is consistent with what I see. > > > All in-flight request may complete between reading 'sdev->device_busy' > > and setting ret as 'BLK_STS_DEV_RESOURCE', then this IO hang may > > be triggered. > > > > More accurate would be: between reading sdev->device_busy and doing > list_splice_init(list,&hctx->dispatch) inside blk_mq_dispatch_rq_list. Exactly. > > I'm not sure how the SCSI driver can do anything about this except > returning BLK_STS_RESOURCE instead (which I guess would not be a great > solution), as basically everything else happens inside blk-mq. Or what > do you have in mind? Not have a better idea yet, :-( thanks, Ming Lei