On Tue, Sep 10, 2024 at 02:34:06PM +0800, yangxingui wrote: > > > On 2024/9/10 12:45, Damien Le Moal wrote: > > On 9/10/24 10:09 AM, yangxingui wrote: > > > > > > > > > On 2024/9/9 21:21, Damien Le Moal wrote: > > > > On 9/9/24 22:10, yangxingui wrote: > > > > > Hello axboe & John, > > > > > > > > > > After the driver exposes all HW queues to the block layer, non-NCQ > > > > > commands will never be executed while fio is continuously running, such > > > > > as a smartctl command. > > > > > > > > > > The cause of the problem is that other hctx used by the NCQ command is > > > > > still active and can continue to issue NCQ commands to the sata disk. > > > > > And the pio command keeps retrying in its corresponding hctx because > > > > > qc_defer() always returns true. > > > > > > > > > > hctx0: ncq, pio, ncq > > > > > hctx1:ncq, ncq, ... > > > > > ... > > > > > hctxn: ncq, ncq, ... > > > > > > > > > > Is there any good solution for this? > > > > > > > > SATA devices are single queue so how can you have multiple queues ? > > > > What adapter are you using ? > > > > > > In the following patch, we expose the host's 16 hardware queues to the block > > > layer. And when connecting to a sata disk, 16 hctx are used. > > > > > > 8d98416a55eb ("scsi: hisi_sas: Switch v3 hw to MQ") > > > > OK, so the HBA is a hisi one, using libsas... > > What is the device ? An SSD ? and HDD ? > Both SATA SSD and SATA HDD have this problem. > > > > > Do you set a block I/O scheduler for the drive, e.g. mq-deadline. If not, does > > setting a scheduler resolve the issue ? > Currently, the default configuration mq-deadline is used, and the same > phenomenon occurs when I try setting it to none. It seems to have nothing to > do with the scheduling strategy. > > > > > I do not have any hisi HBA. I use a lot of mpt3sas and mpi3mr HBAs which also > > have multiple queues with a shared tagset. Never seen the issue you are > > reporting though using HDDs with mq-deadline or bfq as the scheduler. > Unlike libsas, as these hosts don't use qc_defer()? mpt3sas and mpi3mr do not use any libata code at all, the SCSI to ATA Translation (SAT) is done completely by the HBA, so from a Linux perspective, we are issuing SCSI commands to the HBA. We can see that libsas uses ata_std_qc_defer() as its .qc_defer callback: https://github.com/torvalds/linux/blob/v6.11-rc7/drivers/scsi/libsas/sas_ata.c#L566 If you look at SATA 3.a Gold specification, "13.6.3 Intermixing Non-NCQ commands and NCQ commands" "The host shall not issue a non-NCQ command while an NCQ command is outstanding." In AHCI 1.3.1 specification, "1.7 Theory of Operation" "System software is responsible to ensure that queued and non-queued commands are not mixed in the command list for the same device with the exception of the NCQ Unload command." Usually, tools like smartctl submit SCSI commands of type "ATA-16 passthrough", which is a specific SCSI command that just contains a regular ATA command as payload: https://www.smartmontools.org/browser/trunk/smartmontools/scsiata.cpp?desc=1&order=date#L346 For a "ATA-16 passthrough" SCSI command, libata will simply copy the fields from the "ATA-16 passthrough" SCSI command to the appropriate field in a newly created ATA command, see the SAT specification and: https://github.com/torvalds/linux/blob/v6.11-rc7/drivers/ata/libata-scsi.c#L2878-L2887 See also the SAT-6 specification, "6.2.4 Mechanism for processing some commands as NCQ commands" "The ACS-5 standard defines a mechanism for NCQ encapsulation of some commands. Use of this mechanism allows these commands to be processed without quiescing the ATA device." Without considering if it is a good idea or not, it should be possible to translate some commands to instead use the "NCQ encapsulated" variant of the ATA command that was used in the "ATA-16 passthrough" SCSI command. However looking at e.g.: https://www.smartmontools.org/browser/trunk/smartmontools/scsiata.cpp?desc=1&order=date#L566 smartctl is sending a IDENTIFY DEVICE (ECh) ATA command, and this command has no NCQ encapsulated variant. (Had the application instead used a READ LOG DMA EXT command to read the IDENTIFY DEVICE data log, where log page 01h is a copy of IDENTIFY DEVICE data, we would have been able to convert the command to an NCQ encapsulated variant.) TL;DR: I do not see easy generic solution to this problem. To be able to send a non-queued command, there has to be no NCQ commands queued on the device. I guess you could implement a scheduler that would be quiescing the queue, processes the non-queued command, and then thaw the queue, but that would essentially make non-queued commands high priority commands, and could thus be used to seriously limit throughput by just sending some non-queued commands every now and then :) Kind regards, Niklas