On 2020/03/10 14:54, Shinichiro Kawasaki wrote: [...] >>> Yeah, I got the log, and it has been put in the following link: >>> >>> http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_degrade.tar.gz >>> >>> Also I have run some analysis, and block/004 basically call pwrite() & >>> fsync() in each job. >>> >>> 1) v5.6-rc kernel >>> - write rq average latency: 0.091s >>> - flush rq average latency: 0.018s >>> - average issue times of write request: 1.978 //how many trace_block_rq_issue() is called for one rq >>> - average issue times of flush request: 1.052 >>> >>> 2) v5.6-rc patched kernel >>> - write rq average latency: 0.031 >>> - flush rq average latency: 0.035 >>> - average issue times of write request: 1.466 >>> - average issue times of flush request: 1.610 >>> >>> >>> block/004 starts 64 jobs and AHCI's queue can become saturated easily, >>> then BLK_MQ_S_SCHED_RESTART should be set, so write request in dispatch >>> queue can only move on after one request is completed. >>> >>> Looks the flush request takes shorter time than normal write IO. >>> If flush request is added to front of dispatch queue, the next normal >>> write IO could be queued to lld quicker than adding to tail of dispatch >>> queue. >>> trace_block_rq_issue() is called more than one time is because of >>> AHCI or the drive's implementation. It usually means that >>> host->hostt->queuecommand() fails for several times when queuing one >>> single request. For AHCI, I understand it shouldn't fail normally given >>> we guarantee that the actual queue depth is <= either LUN or host's >>> queue depth. Maybe it depends on your SMR's implementation about handling >>> flush/write IO. Could you check why .queuecommand() fails in block/004? > > I put some debug prints and confirmed that the .queuecommand function is > ata_scsi_queuecmd() and it returns SCSI_MLQUEUE_DEVICE_BUSY because > ata_std_qc_defer() returns ATA_DEFER_LINK. The comment of ata_std_qc_defer() > notes that "Non-NCQ commands cannot run with any other command, NCQ or not. As > upper layer only knows the queue depth, we are responsible for maintaining > exclusion. This function checks whether a new command @qc can be issued." Then > I guess .queuecommand() fails because is that Non-NCQ flush command and NCQ > write command are waiting the completion each other. > >> >> Indeed, that is weird that queuecommand fails. There is nothing SMR specific in >> the AHCI code beside disk probe checks. So write & flush handling does not >> differ between SMR and regular disks. The same applies to the drive side. write >> and flush commands are the normal commands, no change at all. The only >> difference being the sequential write constraint which the drive honors by not >> executing the queued write command out of order. But there are no constraint for >> flush on SMR, so we get whatever the drive does, that is, totally drive dependent. >> >> I wonder if the issue may be with the particular AHCI chipset used for this test. >> >>> >>> Also can you provide queue flags via the following command? >>> >>> cat /sys/kernel/debug/block/sdN/state > > The state sysfs attribute was as follows: > > SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|SCSI_PASSTHROUGH|26 > > It didn't change before and after the block/004 run. > > >>> >>> I understand flush request should be slower than normal write, however >>> looks not true in this hardware. >> >> Probably due to the fact that since the writes are sequential, there is a lot of >> drive internal optimization that can be done to minimize the cost of flush >> (internal data streaming from cache to media, media-cache use, etc) That is true >> for regular disks too. And all of this is highly dependent on the hardware >> implementation. > > This discussion tempted me to take closer look in the traces. And I noticed that > number of flush commands issued is different with and without the patch. > > | without patch | with patch > ------------------------+---------------+------------ > block_getrq: rwbs=FWS | 32640 | 32640 > block_rq_issue: rwbs=FF | 32101 | 7593 > > Without the patch, flush command is issued between two write commands. With the > patch, some write commands are executed without flush between them. > > I wonder how the requeue list position of flush command (head vs tail) changes > the number of flush commands to issue. > > Another weird thing is number of block_getrq traces of flush (rwds=FWS). It > doubles number of writes (256 * 64 = 16320). I will chase this further. > > >> Shinichiro, >> >> It may be worth trying the same run with & without Ming's patch on a machine >> with a different chipset... > > Thanks for the advice. So far, I observer the long block/004 runtime on two > systems. One with Intel C220 Series SATA controller, and the other with Intel > 200 Series PCH SATA controller. I will try to find other SATA controller. I do not think it is necessary. I forgot that flush cache is a non-ncq command. That is why queuecommand fails as you found out: it is waiting for all NCQ commands to complete first. Nothing to do with the chipset type. This is per ATA specs. -- Damien Le Moal Western Digital Research