On Wed, Dec 25, 2019 at 10:30:57PM -0500, Theodore Y. Ts'o wrote: > On Thu, Dec 26, 2019 at 10:27:02AM +0800, Ming Lei wrote: > > Maybe we need to be careful for HDD., since the request count in scheduler > > queue is double of in-flight request count, and in theory NCQ should only > > cover all in-flight 32 requests. I will find a sata HDD., and see if > > performance drop can be observed in the similar 'cp' test. > > Please try to measure it, but I'd be really surprised if it's > significant with with modern HDD's. Just find one machine with AHCI SATA, and run the following xfs overwrite test: #!/bin/bash DIR=$1 echo 3 > /proc/sys/vm/drop_caches fio --readwrite=write --filesize=5g --overwrite=1 --filename=$DIR/fiofile \ --runtime=60s --time_based --ioengine=psync --direct=0 --bs=4k --iodepth=128 --numjobs=2 --group_reporting=1 --name=overwrite FS is xfs, and disk is LVM over AHCI SATA with NCQ(depth 32), because the machine is picked up from RH beaker, and it is the only disk in the box. #lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 931.5G 0 disk ├─sda1 8:1 0 1G 0 part /boot └─sda2 8:2 0 930.5G 0 part ├─rhel_hpe--ml10gen9--01-root 253:0 0 50G 0 lvm / ├─rhel_hpe--ml10gen9--01-swap 253:1 0 3.9G 0 lvm [SWAP] └─rhel_hpe--ml10gen9--01-home 253:2 0 876.6G 0 lvm /home kernel: 3a7ea2c483a53fc("scsi: provide mq_ops->busy() hook") which is the previous commit of f664a3cc17b7 ("scsi: kill off the legacy IO path"). |scsi_mod.use_blk_mq=N |scsi_mod.use_blk_mq=Y | ----------------------------------------------------------- throughput: |244MB/s |169MB/s | ----------------------------------------------------------- Similar result can be observed on v5.4 kernel(184MB/s) with same test steps. > That because they typically have > a queue depth of 16, and a max_sectors_kb of 32767 (e.g., just under > 32 MiB). Sort seeks are typically 1-2 ms, with full stroke seeks > 8-10ms. Typical sequential write speeds on a 7200 RPM drive is > 125-150 MiB/s. So suppose every other request sent to the HDD is from > the other request stream. The disk will chose the 8 requests from its > queue that are contiguous, and so it will be writing around 256 MiB, > which will take 2-3 seconds. If it then needs to spend between 1 and > 10 ms seeking to another location of the disk, before it writes the > next 256 MiB, the worst case overhead of that seek is 10ms / 2s, or > 0.5%. That may very well be within your measurements' error bars. Looks you assume that disk seeking just happens once when writing around 256MB. This assumption may not be true, given all data can be in page cache before writing. So when two tasks are submitting IOs concurrently, IOs from each single task is sequential, and NCQ may order the current batch submitted from the two streams. However disk seeking may still be needed for the next batch handled by NCQ. > And of course, note that in real life, we are very *often* writing to > multiple files in parallel, for example, during a "make -j16" while > building the kernel. Writing a single large file is certainly > something people do (but even there people who are burning a 4G DVD > rip are often browsing the web while they are waiting for it to > complete, and the browser will be writing cache files, etc.). So > whether or not this is something where we should be stressing over > this specific workload is going to be quite debateable. Thanks, Ming