Re: [bug report] shared tags causes IO hang and performance drop

John Garry <john.garry@xxxxxxxxxx> · Thu, 15 Apr 2021 16:41:06 +0100

On 15/04/2021 13:18, Ming Lei wrote:
On Thu, Apr 15, 2021 at 11:41:52AM +0100, John Garry wrote:
Hi Ming,

I'll have a look.

BTW, are you intentionally using scsi_debug over null_blk? null_blk supports
shared sbitmap as well, and performance figures there are generally higher
than scsi_debug for similar fio settings.
I use both, but scsi_debug can cover scsi stack test.

Hi Ming,

I can't seem to recreate your same issue. Are you mainline defconfig, or 
a special disto config?

What is am seeing is that scsi_debug throughput is fixed @ ~ 32K IOPS 
for scsi_debug with both modprobe configs and both none and mq-deadline 
IO sched. CPU util seems a bit higher for hosttags with none.

When I tried null_blk, the performance diff for hosttags and using none 
IO scheduler was noticeable, but not for mq-deadline:

1) randread test with deadline
|IOPS | FIO CPU util
------------------------------------------------
hosttags* | 325K usr=1.34%, sys=76.49%
------------------------------------------------
non hosttags** | 325k usr=1.36%, sys=76.25%

2) randread test with none
|IOPS | FIO CPU util
------------------------------------------------
hosttags* |6421k | usr=23.84%, sys=76.06%
------------------------------------------------
non hosttags** | 6893K | usr=25.57%, sys=74.33%
------------------------------------------------

* insmod null_blk.ko submit_queues=32 shared_tag_bitmap=1
** insmod null_blk.ko submit_queues=32

However I don't think that the null_blk test is a good like-for-like 
comparison, as setting shared_tag_bitmap means just just the same tagset 
over all hctx, but still have same count of hctx.

Just setting submit_queues=1 gives a big drop in performance, as would 
be expected.

Thanks,
John

EOM

		IOPs
mq-deadline	usr=21.72%, sys=44.18%,		423K
none		usr=23.15%, sys=74.01%		450K
Today I re-run the scsi_debug test on two server hardwares(32cores, dual
numa nodes), and the CPU utilization issue can be reproduced, follow
the test result:

1) randread test on ibm-x3850x6[*] with deadline

                |IOPS    | FIO CPU util
------------------------------------------------
hosttags      | 94k    | usr=1.13%, sys=14.75%
------------------------------------------------
non hosttags  | 124k   | usr=1.12%, sys=10.65%,

2) randread test on ibm-x3850x6[*] with none
                |IOPS    | FIO CPU util
------------------------------------------------
hosttags      | 120k   | usr=0.89%, sys=6.55%
------------------------------------------------
non hosttags  | 121k   | usr=1.07%, sys=7.35%
------------------------------------------------

   *:
   	- that is the machine Yanhui reported VM cpu utilization increased by 20%
	- kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57)
	- also run same test on another 32cores machine, IOPS drop isn't
	  observed, but CPU utilization is increased obviously

3) test script
#/bin/bash

run_fio() {
	RTIME=$1
	JOBS=$2
	DEVS=$3
	BS=$4

	QD=64
	BATCH=16

	fio --bs=$BS --ioengine=libaio \
		--iodepth=$QD \
	    --iodepth_batch_submit=$BATCH \
		--iodepth_batch_complete_min=$BATCH \
		--filename=$DEVS \
		--direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \
		--name=test --group_reporting
}

SCHED=$1

NRQS=`getconf _NPROCESSORS_ONLN`

rmmod scsi_debug
modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256
sleep 2
DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`
echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler
echo 128 >/sys/block/`basename $DEV`/device/queue_depth
run_fio 20 16 $DEV 8K

rmmod scsi_debug
modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256
sleep 2
DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`
echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler
echo 128 >/sys/block/`basename $DEV`/device/queue_depth
run_fio 20 16 $DEV 8k