Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Thu, 19 Sep 2024 20:26:52 +0800

Hi,

在 2024/09/11 6:38, Damien Le Moal 写道:
On 9/10/24 20:27, Niklas Cassel wrote:
On Tue, Sep 10, 2024 at 02:34:06PM +0800, yangxingui wrote:

On 2024/9/10 12:45, Damien Le Moal wrote:
On 9/10/24 10:09 AM, yangxingui wrote:

On 2024/9/9 21:21, Damien Le Moal wrote:
On 9/9/24 22:10, yangxingui wrote:
Hello axboe & John,

After the driver exposes all HW queues to the block layer, non-NCQ
commands will never be executed while fio is continuously running, such
as a smartctl command.

The cause of the problem is that other hctx used by the NCQ command is
still active and can continue to issue NCQ commands to the sata disk.
And the pio command keeps retrying in its corresponding hctx because
qc_defer() always returns true.

hctx0: ncq, pio, ncq
hctx1：ncq, ncq, ...
...
hctxn: ncq, ncq, ...

Is there any good solution for this?

SATA devices are single queue so how can you have multiple queues ?
What adapter are you using ?

In the following patch, we expose the host's 16 hardware queues to the block
layer. And when connecting to a sata disk, 16 hctx are used.

8d98416a55eb ("scsi: hisi_sas: Switch v3 hw to MQ")

OK, so the HBA is a hisi one, using libsas...
What is the device ? An SSD ? and HDD ?
Both SATA SSD and SATA HDD have this problem.

Do you set a block I/O scheduler for the drive, e.g. mq-deadline. If not, does
setting a scheduler resolve the issue ?
Currently, the default configuration mq-deadline is used, and the same
phenomenon occurs when I try setting it to none. It seems to have nothing to
do with the scheduling strategy.

I do not have any hisi HBA. I use a lot of mpt3sas and mpi3mr HBAs which also
have multiple queues with a shared tagset. Never seen the issue you are
reporting though using HDDs with mq-deadline or bfq as the scheduler.
Unlike libsas, as these hosts don't use qc_defer()?

mpt3sas and mpi3mr do not use any libata code at all, the SCSI to ATA
Translation (SAT) is done completely by the HBA, so from a Linux
perspective, we are issuing SCSI commands to the HBA.

Yes, but we still can get requeue happening. Though for a SATA drive, that is
unlikely since the max queue depth is clearly defined, unlike for SAS drives

We can see that libsas uses ata_std_qc_defer() as its .qc_defer callback:
https://github.com/torvalds/linux/blob/v6.11-rc7/drivers/scsi/libsas/sas_ata.c#L566

And that may be the issue. More on this below.

Without considering if it is a good idea or not, it should be possible to
translate some commands to instead use the "NCQ encapsulated" variant of
the ATA command that was used in the "ATA-16 passthrough" SCSI command.

That would be way too much work on the user side, and likely open up a can of
device bugs unseen until now.

To be able to send a non-queued command, there has to be no NCQ commands queued
on the device. I guess you could implement a scheduler that would be quiescing
the queue, processes the non-queued command, and then thaw the queue, but that
would essentially make non-queued commands high priority commands, and could
thus be used to seriously limit throughput by just sending some non-queued
commands every now and then :)

Passthrough commands do not go through the scheduler and are submitted directly
to the dispatch queue, generally at the head of it (see blk_mq_insert_request()).

So for a single queue device, even if ata_qc_defer causes a requeue, the
passthrough command ends up back at the top of the dispatch queue. After
repeating this a few times, all in-flight NCQ commands complete and the
passthrough command goes through.

But I feel this is very fragile given that the block layer requeue is done
through a work item, so in parallel to an application submitting IOs. So in
theory, I think that the requeue for the passthrough command could happen forever...

And for a multi-queue setup like with the hisi adapter, that is what is happening.

I do not have any good idea how to fix that yet. We need to find something.
scsi_queue_rq() and the budget/host or device blocked state management may help
with that, or we have a bug there... In any case, I do not think it is a block
layer issue as the block layer knows nothing about NCQ vs non-NCQ.

Does libata return a specific value in this case? If so, maybe we can
stop other hctx untill this IO is handled.

For now, I think libata should use single hctx, it just doesn't support
multiple hctx yet.

Thanks,
Kuai