Hi Mike, On Wed, Mar 30, 2022 at 12:52:58PM -0400, Mike Snitzer wrote: > Hey Tejun and Dennis, > > I recently found that due to bio_set_dev()'s call to > bio_associate_blkg(), bio_set_dev() needs much more cpu than ideal; > especially when doing 4K IOs via io_uring's HIPRI bio-polling. > > I'm very naive about blk-cgroups.. so I'm hopeful you or others can > help me cut through this to understand what the ideal outcome should > be for DM's bio clone + remap heavy use-case as it relates to > bio_associate_blkg. > > If I hack dm-linear with a local __bio_set_dev that simply removes > the call to bio_associate_blkg() my IOPS go from ~980K to 995K. > > Looking at what is happening a bit, relative to this DM bio cloning > usecase, it seems __bio_clone() calls bio_clone_blkg_association() to > clone the blkg from DM device, then dm-linear.c:linear_map's call > to bio_set_dev() will cause bio_associate_blkg(bio) to reuse the css > but then it triggers an update because the bdev is being remapped in > the bio (due to linear_map sending the IO to the real underlying > device). End result _seems_ like collective wasteful effort to get the > blk-cgroup resources setup properly in the face of a simple remap. > > Seems the current DM pattern is causing repeat blkg work for _every_ > remapped bio? Do you see a way to speed up repeat calls to > bio_associate_blkg()? > I must admit I wrote this with limited knowledge of bio cloning at the time. I can fill in the thought process here. The idea was every bio should have a blkg associated with it for io accounting and things like blk-iolatency and blk-iocost. The device abstraction I believe means we can set limits here as well on submission rate to the md device. I think cloning is a special case that I might have gotten wrong. If there is a bio_set_dev() call after each clone(), then the bio_clone_blkg_association() is excess work. We'd need to audit how bio_alloc_clone() is being used to be safe. Alternatively, we could opt for a bio_alloc_clone_noblkg(), but that's a little bit uglier. 1. bio_set_dev() above md <- needed so we can do throttling on the md. 2. bio_alloc_clone() <- doesn't need to clone the blkg() info. 3. bio_set_dev() in md <- sets the right underlying device association. Thanks, Dennis > Test kernel is my latest dm-5.19 branch (though latest Linus 5.18-rc0 > kernel should be fine too): > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19 > > I'm using dm-linear ontop on a 16G blk-mq null_blk device: > > modprobe null_blk queue_mode=2 poll_queues=2 bs=4096 gb=16 > SIZE=`blockdev --getsz /dev/nullb0` > echo "0 $SIZE linear /dev/nullb0 0" | dmsetup create linear > > And running the workload with fio using this wrapper script: > io_uring.sh 20 1 /dev/mapper/linear 4096 > > #!/bin/bash > > RTIME=$1 > JOBS=$2 > DEV=$3 > BS=$4 > > QD=64 > BATCH=16 > HI=1 > > fio --bs=$BS --ioengine=io_uring --fixedbufs --registerfiles --hipri=$HI \ > --iodepth=$QD \ > --iodepth_batch_submit=$BATCH \ > --iodepth_batch_complete_min=$BATCH \ > --filename=$DEV \ > --direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \ > --name=test --group_reporting