On 6/3/21 5:13 AM, Stefan Hajnoczi wrote: > On Tue, May 25, 2021 at 01:05:51PM -0500, Mike Christie wrote: >> Results: >> -------- >> When running with the null_blk driver and vhost-scsi I can get 1.2 >> million IOPs by just running a simple >> >> fio --filename=/dev/sda --direct=1 --rw=randrw --bs=4k --ioengine=libaio >> --iodepth=128 --numjobs=8 --time_based --group_reporting --name=iops >> --runtime=60 --eta-newline=1 >> >> The VM has 8 vCPUs and sda has 8 virtqueues and we can do a total of >> 1024 cmds per devices. To get 1.2 million IOPs I did have to tune and >> ran the virsh emulatorpin command so the vhost threads were running >> on different CPUs than the VM. If the vhost threads share CPUs then I >> get around 800K. >> >> For a more real device that are also CPU hogs like iscsi, I can still >> get 1 million IOPs using 1 dm-multipath device over 8 iscsi paths >> (natively it gets 1.1 million IOPs). > > There is no comparison against a baseline, but I guess it would be the > same 8 vCPU guest with single queue vhost-scsi? > For the iscsi device the max IOPs for the single thread case was around 380K IOPs. Here are the results with null_blk as the backend device with a 16 vCPU guest to give you a better picture. fio numjobs 1 2 4 8 12 16 -------------------------------------------------------- Current upstream (single thread per vhost-scsi device). After 8 jobs there was no perf diff. ******************************************************** VQs 1 130k 338k 390k 404k - - 2 146k 440k 448k 478k - - 4 146k 456k 448k 482k - - 8 154k 464k 500k 490k - - 12 160k 454k 486k 490k - - 16 162k 460k 484k 486k - - thread per VQ: After 16 jobs there was no perf diff even if I increased the number of guest vCPUs. ********************************************************* 1 same as above 2 166k 320k 542k 664k 558k 658k 4 156k 310k 660k 986k 860k 890k 8 156k 328k 652k 988k 972k 1074k 12 162k 336k 660k 1172k 1190k 1324 16 162k 332k 664k 1398k 850k 1426k Note: - For numjobs > 8, I lowered iodepth so we had a total of 1024 cmds over all jobs. - virtqueue_size/cmd_per_lun=1024 was used for all tests. - If I modify vhost-scsi so vhost_scsi_handle_vq queues the response immediately so we never enter the LIO/block/scsi layers then I can get around 1.6-1.8M IOPs as the max. - There are some device wide locks in the LIO main IO path that we are hitting in these results. We are working on removing them.