Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running

Damien Le Moal <dlemoal@xxxxxxxxxx> · Thu, 19 Sep 2024 16:14:15 +0200

On 2024/09/19 14:26, Yu Kuai wrote:
> Hi,
> 
> 在 2024/09/11 6:38, Damien Le Moal 写道:
>> On 9/10/24 20:27, Niklas Cassel wrote:
>>> On Tue, Sep 10, 2024 at 02:34:06PM +0800, yangxingui wrote:
>>>>
>>>>
>>>> On 2024/9/10 12:45, Damien Le Moal wrote:
>>>>> On 9/10/24 10:09 AM, yangxingui wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/9/9 21:21, Damien Le Moal wrote:
>>>>>>> On 9/9/24 22:10, yangxingui wrote:
>>>>>>>> Hello axboe & John,
>>>>>>>>
>>>>>>>> After the driver exposes all HW queues to the block layer, non-NCQ
>>>>>>>> commands will never be executed while fio is continuously running, such
>>>>>>>> as a smartctl command.
>>>>>>>>
>>>>>>>> The cause of the problem is that other hctx used by the NCQ command is
>>>>>>>> still active and can continue to issue NCQ commands to the sata disk.
>>>>>>>> And the pio command keeps retrying in its corresponding hctx because
>>>>>>>> qc_defer() always returns true.
>>>>>>>>
>>>>>>>> hctx0: ncq, pio, ncq
>>>>>>>> hctx1：ncq, ncq, ...
>>>>>>>> ...
>>>>>>>> hctxn: ncq, ncq, ...
>>>>>>>>
>>>>>>>> Is there any good solution for this?
>>>>>>>
>>>>>>> SATA devices are single queue so how can you have multiple queues ?
>>>>>>> What adapter are you using ?
>>>>>>
>>>>>> In the following patch, we expose the host's 16 hardware queues to the block
>>>>>> layer. And when connecting to a sata disk, 16 hctx are used.
>>>>>>
>>>>>> 8d98416a55eb ("scsi: hisi_sas: Switch v3 hw to MQ")
>>>>>
>>>>> OK, so the HBA is a hisi one, using libsas...
>>>>> What is the device ? An SSD ? and HDD ?
>>>> Both SATA SSD and SATA HDD have this problem.
>>>>
>>>>>
>>>>> Do you set a block I/O scheduler for the drive, e.g. mq-deadline. If not, does
>>>>> setting a scheduler resolve the issue ?
>>>> Currently, the default configuration mq-deadline is used, and the same
>>>> phenomenon occurs when I try setting it to none. It seems to have nothing to
>>>> do with the scheduling strategy.
>>>>
>>>>>
>>>>> I do not have any hisi HBA. I use a lot of mpt3sas and mpi3mr HBAs which also
>>>>> have multiple queues with a shared tagset. Never seen the issue you are
>>>>> reporting though using HDDs with mq-deadline or bfq as the scheduler.
>>>> Unlike libsas, as these hosts don't use qc_defer()?
>>>
>>> mpt3sas and mpi3mr do not use any libata code at all, the SCSI to ATA
>>> Translation (SAT) is done completely by the HBA, so from a Linux
>>> perspective, we are issuing SCSI commands to the HBA.
>>
>> Yes, but we still can get requeue happening. Though for a SATA drive, that is
>> unlikely since the max queue depth is clearly defined, unlike for SAS drives
>>
>>> We can see that libsas uses ata_std_qc_defer() as its .qc_defer callback:
>>> https://github.com/torvalds/linux/blob/v6.11-rc7/drivers/scsi/libsas/sas_ata.c#L566
>>
>> And that may be the issue. More on this below.
>>
>>> Without considering if it is a good idea or not, it should be possible to
>>> translate some commands to instead use the "NCQ encapsulated" variant of
>>> the ATA command that was used in the "ATA-16 passthrough" SCSI command.
>>
>> That would be way too much work on the user side, and likely open up a can of
>> device bugs unseen until now.
>>
>>> To be able to send a non-queued command, there has to be no NCQ commands queued
>>> on the device. I guess you could implement a scheduler that would be quiescing
>>> the queue, processes the non-queued command, and then thaw the queue, but that
>>> would essentially make non-queued commands high priority commands, and could
>>> thus be used to seriously limit throughput by just sending some non-queued
>>> commands every now and then :)
>>
>> Passthrough commands do not go through the scheduler and are submitted directly
>> to the dispatch queue, generally at the head of it (see blk_mq_insert_request()).
>>
>> So for a single queue device, even if ata_qc_defer causes a requeue, the
>> passthrough command ends up back at the top of the dispatch queue. After
>> repeating this a few times, all in-flight NCQ commands complete and the
>> passthrough command goes through.
>>
>> But I feel this is very fragile given that the block layer requeue is done
>> through a work item, so in parallel to an application submitting IOs. So in
>> theory, I think that the requeue for the passthrough command could happen forever...
>>
>> And for a multi-queue setup like with the hisi adapter, that is what is happening.
>>
>> I do not have any good idea how to fix that yet. We need to find something.
>> scsi_queue_rq() and the budget/host or device blocked state management may help
>> with that, or we have a bug there... In any case, I do not think it is a block
>> layer issue as the block layer knows nothing about NCQ vs non-NCQ.
> 
> Does libata return a specific value in this case? If so, maybe we can
> stop other hctx untill this IO is handled.
> 
> For now, I think libata should use single hctx, it just doesn't support
> multiple hctx yet.

libata does not care/know about hctx. It only issues commands to ATA devices,
which always are single queue. And pure SATA adapters like AHCI are always
single queue.

The issue at hand can happen only for libsas based SAS HBAs that have multiple
command submission queues (with a shared tag set). Commands for the same device
may end up being submitted through different queues, and when the submitted
commands include a mix of NCQ and non-NCQ commands, the problem happens without
libata being able to easily do anything about it, and not possible control
possible at the scsi layer either since the commands submitted are SCSI (not yet
translated to ATA commands) which do not have any NCQ/non-NCQ exclusion
knowledge at all. NCQ is an ATA concept unknown to the scsi and block layer.

We (Niklas and I) are trying to find a solution, but that may not be within
libata itself. It may need changes to libsas as well. Not sure yet. Still exploring.

> 
> Thanks,
> Kuai
> 
>>
> 

-- 
Damien Le Moal
Western Digital Research