On 1/25/19 9:33 AM, Ming Lei wrote: > On Fri, Jan 25, 2019 at 8:45 AM Florian Stecker <m19@xxxxxxxxxxxxxxxxx> wrote: >> >> >> >> On 1/22/19 4:22 AM, Ming Lei wrote: >>> On Tue, Jan 22, 2019 at 5:13 AM Florian Stecker <m19@xxxxxxxxxxxxxxxxx> wrote: >>>> >>>> Hi everyone, >>>> >>>> on my laptop, I am experiencing occasional hangs of applications during >>>> fsync(), which are sometimes up to 30 seconds long. I'm using a BTRFS >>>> which spans two partitions on the same SSD (one of them used to contain >>>> a Windows, but I removed it and added the partition to the BTRFS volume >>>> instead). Also, the problem only occurs when an I/O scheduler >>>> (mq-deadline) is in use. I'm running kernel version 4.20.3. >>>> >>>> From what I understand so far, what happens is that a sync request >>>> fails in the SCSI/ATA layer, in ata_std_qc_defer(), because it is a >>>> "Non-NCQ command" and can not be queued together with other commands. >>>> This propagates up into blk_mq_dispatch_rq_list(), where the call >>>> >>>> ret = q->mq_ops->queue_rq(hctx, &bd); >>>> >>>> returns BLK_STS_DEV_RESOURCE. Later in blk_mq_dispatch_rq_list(), there >>>> is the piece of code >>>> >>>> needs_restart = blk_mq_sched_needs_restart(hctx); >>>> if (!needs_restart || >>>> (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) >>>> blk_mq_run_hw_queue(hctx, true); >>>> else if (needs_restart && (ret == BLK_STS_RESOURCE)) >>>> blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY); >>>> >>>> which restarts the queue after a delay if BLK_STS_RESOURCE was returned, >>>> but somehow not for BLK_STS_DEV_RESOURCE. Instead, nothing happens and >>>> fsync() seems to hang until some other process wants to do I/O. >>>> >>>> So if I do >>>> >>>> - else if (needs_restart && (ret == BLK_STS_RESOURCE)) >>>> + else if (needs_restart && (ret == BLK_STS_RESOURCE || ret == >>>> BLK_STS_DEV_RESOURCE)) >>>> >>>> it fixes my problem. But was there a reason why BLK_STS_DEV_RESOURCE was >>>> treated differently that BLK_STS_RESOURCE here? >>> >>> Please see the comment: >>> >>> /* >>> * BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if >>> * device related resources are unavailable, but the driver can guarantee >>> * that the queue will be rerun in the future once resources become >>> * available again. This is typically the case for device specific >>> * resources that are consumed for IO. If the driver fails allocating these >>> * resources, we know that inflight (or pending) IO will free these >>> * resource upon completion. >>> * >>> * This is different from BLK_STS_RESOURCE in that it explicitly references >>> * a device specific resource. For resources of wider scope, allocation >>> * failure can happen without having pending IO. This means that we can't >>> * rely on request completions freeing these resources, as IO may not be in >>> * flight. Examples of that are kernel memory allocations, DMA mappings, or >>> * any other system wide resources. >>> */ >>> #define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) >>> >>>> >>>> In any case, it seems wrong to me that ret is used here at all, as it >>>> just contains the return value of the last request in the list, and >>>> whether we rerun the queue should probably not only depend on the last >>>> request? >>>> >>>> Can anyone of the experts tell me whether this makes sense or I got >>>> something completely wrong? >>> >>> Sounds a bug in SCSI or ata driver. >>> >>> I remember there is hole in SCSI wrt. returning BLK_STS_DEV_RESOURCE, >>> but I never get lucky to reproduce it. >>> >>> scsi_queue_rq(): >>> ...... >>> case BLK_STS_RESOURCE: >>> if (atomic_read(&sdev->device_busy) || >>> scsi_device_blocked(sdev)) >>> ret = BLK_STS_DEV_RESOURCE; >>> >> >> OK, please tell me if I understand this right now. What is supposed to >> happen is >> >> - request 1 is being processed by the device >> - request 2 (sync) fails with STS_DEV_RESOURCE because request 1 is >> still being processed >> - request 2 is put into hctx->dispatch inside blk_mq_dispatch_rq_list * >> - queue does not get rerun because of STS_DEV_RESOURCE >> - request 1 finishes >> - scsi_end_request calls blk_mq_run_hw_queues >> - blk_mq_run_hw_queue finds request 2 in hctx->dispatch ** >> - runs request 2 > > Right, it is one typical case if the hw queue depth is 2. > >> >> However, what happens for me is that request 1 finishes earlier, so that >> ** is executed before *, and finds an empty hctx->dispatch list. >> >> At least this is consistent with what I see. >> >> > All in-flight request may complete between reading 'sdev->device_busy' >> > and setting ret as 'BLK_STS_DEV_RESOURCE', then this IO hang may >> > be triggered. >> > >> >> More accurate would be: between reading sdev->device_busy and doing >> list_splice_init(list,&hctx->dispatch) inside blk_mq_dispatch_rq_list. > > Exactly. It sounds like not so easy to trigger. blk_mq_dispatch_rq_list scsi_queue_rq if (atomic_read(&sdev->device_busy) || scsi_device_blocked(sdev)) ret = BLK_STS_DEV_RESOURCE; scsi_end_request __blk_mq_end_request blk_mq_sched_restart // clear RESTART blk_mq_run_hw_queue blk_mq_run_hw_queues list_splice_init(list, &hctx->dispatch) needs_restart = blk_mq_sched_needs_restart(hctx) The 'needs_restart' will be false, so the queue would be rerun. Thanks Jianchao > >> >> I'm not sure how the SCSI driver can do anything about this except >> returning BLK_STS_RESOURCE instead (which I guess would not be a great >> solution), as basically everything else happens inside blk-mq. Or what >> do you have in mind? > > Not have a better idea yet, :-( > > thanks, > Ming Lei >