On 15/04/2021 13:18, Ming Lei wrote:
On Thu, Apr 15, 2021 at 11:41:52AM +0100, John Garry wrote:
Hi Ming,
I'll have a look.
BTW, are you intentionally using scsi_debug over null_blk? null_blk supports
shared sbitmap as well, and performance figures there are generally higher
than scsi_debug for similar fio settings.
I use both, but scsi_debug can cover scsi stack test.
Hi Ming,
I can't seem to recreate your same issue. Are you mainline defconfig, or
a special disto config?
What is am seeing is that scsi_debug throughput is fixed @ ~ 32K IOPS
for scsi_debug with both modprobe configs and both none and mq-deadline
IO sched. CPU util seems a bit higher for hosttags with none.
When I tried null_blk, the performance diff for hosttags and using none
IO scheduler was noticeable, but not for mq-deadline:
1) randread test with deadline
|IOPS | FIO CPU util
------------------------------------------------
hosttags* | 325K usr=1.34%, sys=76.49%
------------------------------------------------
non hosttags** | 325k usr=1.36%, sys=76.25%
2) randread test with none
|IOPS | FIO CPU util
------------------------------------------------
hosttags* |6421k | usr=23.84%, sys=76.06%
------------------------------------------------
non hosttags** | 6893K | usr=25.57%, sys=74.33%
------------------------------------------------
* insmod null_blk.ko submit_queues=32 shared_tag_bitmap=1
** insmod null_blk.ko submit_queues=32
However I don't think that the null_blk test is a good like-for-like
comparison, as setting shared_tag_bitmap means just just the same tagset
over all hctx, but still have same count of hctx.
Just setting submit_queues=1 gives a big drop in performance, as would
be expected.
Thanks,
John
EOM
IOPs
mq-deadline usr=21.72%, sys=44.18%, 423K
none usr=23.15%, sys=74.01% 450K
Today I re-run the scsi_debug test on two server hardwares(32cores, dual
numa nodes), and the CPU utilization issue can be reproduced, follow
the test result:
1) randread test on ibm-x3850x6[*] with deadline
|IOPS | FIO CPU util
------------------------------------------------
hosttags | 94k | usr=1.13%, sys=14.75%
------------------------------------------------
non hosttags | 124k | usr=1.12%, sys=10.65%,
2) randread test on ibm-x3850x6[*] with none
|IOPS | FIO CPU util
------------------------------------------------
hosttags | 120k | usr=0.89%, sys=6.55%
------------------------------------------------
non hosttags | 121k | usr=1.07%, sys=7.35%
------------------------------------------------
*:
- that is the machine Yanhui reported VM cpu utilization increased by 20%
- kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57)
- also run same test on another 32cores machine, IOPS drop isn't
observed, but CPU utilization is increased obviously
3) test script
#/bin/bash
run_fio() {
RTIME=$1
JOBS=$2
DEVS=$3
BS=$4
QD=64
BATCH=16
fio --bs=$BS --ioengine=libaio \
--iodepth=$QD \
--iodepth_batch_submit=$BATCH \
--iodepth_batch_complete_min=$BATCH \
--filename=$DEVS \
--direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \
--name=test --group_reporting
}
SCHED=$1
NRQS=`getconf _NPROCESSORS_ONLN`
rmmod scsi_debug
modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256
sleep 2
DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`
echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler
echo 128 >/sys/block/`basename $DEV`/device/queue_depth
run_fio 20 16 $DEV 8K
rmmod scsi_debug
modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256
sleep 2
DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`
echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler
echo 128 >/sys/block/`basename $DEV`/device/queue_depth
run_fio 20 16 $DEV 8k