On Fri, Apr 12, 2019 at 7:01 AM Dongsheng Yang <dongsheng.yang@xxxxxxxxxxxx> wrote: > > > > On 04/11/2019 05:37 PM, Ilya Dryomov wrote: > > On Thu, Apr 11, 2019 at 5:46 AM Dongsheng Yang > > <dongsheng.yang@xxxxxxxxxxxx> wrote: > >> To improve sequential IO of rbd device, we wish there are > >> as more chance to merge IO as possible. > >> > >> But when the blck_mq_make_request() try to merge with IOs in Software > >> queue by blk_mq_sched_bio_merge(), it would find software queue > >> is empty at most cases. > >> > >> The reason is when blk_mq call blk_mq_run_hw_queue() to dispatch requests > >> from SQ to driver, it will call .queue_rq() implemented in rbd driver. > >> And we only queue a work in rbd_mq_ops->queue_rq() without possible to > >> return BLK_STS_DEV_RESOURCE or BLK_STS_RESOURCE. Then the requests in > >> SQ will be dequeued in anyway. That means in most case, Software is > >> empty when the next request want to do a IO merge. > >> > >> To improve it, blk_mq provides .get_budget and .put_budget in blk_mq_ops. > >> Then request need to get budget at first when dequeue from software queue, > >> if it return false, request will be remained in software queue and wait > >> for next dispatch. > >> > >> So this commit introduce an option of busy_threshold in rbd driver, and > >> implement .get_budget and .put_budget in rbd_mq_ops. We will increase > >> budget_reserved in .get_budget and decrese budget_reserved in .put_budget. > >> When the budget_reserved is larger than busy_threshold, return false in > >> .get_budget to indecate rbd device is busy now. > >> > >> Signed-off-by: Dongsheng Yang <dongsheng.yang@xxxxxxxxxxxx> > >> --- > >> In my testing, we can get about 60% improvement in seq reading and > >> similar level in rand reading, with -o busy_threshold=64. (bs=4K iodepth=128 numjobs=1) > > Hi Dongsheng, > > > > Have you tried running the same test with -o queue_depth=64? I thought > > get_budget was supposed to be a SCSI-specific workaround for a very big > > tagset shared between many individual devices capable of handling only > > a couple of requests at the same time. For rbd, -o queue_depth controls > > the size of the tagset for each individual device, so an extra layer of > > budgeting shouldn't be needed? > Hi Ilya, > The reason I introduced busy_threshold is that I found the nr_requests > is same with queue_depth in my case. So even we use queue_depth=64, > there is no chance to do merge, because the newer request can't be > submitted into blk_mq. > > But I think I was wrong, I found that's true only in none scheduler what > is used in my usecase. Any other scheduler > will overwrite the nr_requests as double of queue_depth: > > q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth, > BLKDEV_MAX_RQ); > > So when we are using queue_depth=64, the nr_requests would be 128. Then > when we do test with iodepth=128 numjobs=1, there are 64 IO inflight in rbd > driver and the later 64 get a chance to be merged. > > Test result: > (1) rbd map IMAGE -o queue_depth=64 (mq-deadline fio bs=4K iodepth=128 > numjobs=1) > iops=17k mergedIO (85555) > (2) rbd map IMAGE -o busy_threshold=64 (mq-daedline fio bs=4K > iodepth=128 numjobs=1) > iops=16.9K mergedIO (103374) > > The test result match what we analyzed: IO get merged with > queue_depth=64. But > the number of merged IO is smaller than busy_threshold=64. That's > reasonable, as > the nr_requests in case2 is 256, so we have (256 - 64) = 198 IOs in > pending can get > chance to be merged. But in case1, the nr_requests is 128, so, there is > 64 IOs can > get chance to merge. But busy_threshold case introduce .get_budget and > .put_budget > which will introduce some overhead. that result the iops is smaller than > queue_depth=64 > case. So queue_depth=64 is better than busy_threshold=64 with > mq-deadline scheduler. > > There is only None scheduler scenario the busy_threshold could be > consider usable: > a) rbd map IMAGE -o queue_depth=64 (none fio bs=4K iodepth=128 > numjobs=1) > iops=9464, mergedIO(0) > b) rbd map IMAGE -o queue_depth=64 (none fio bs=4K iodepth=128 > numjobs=1) > iops=18314, mergedIO(100099) > > Hmmm, none merge in none scheduler sounds not a problem. So looks really > not worthy to > introduce busy_threshold in upstream at all. Right, get_budget is really just a workaround for SCSI. Note that scheduler = none doesn't mean no merges. With explicit plugging merges will still happen even without a scheduler. By default fio will submit/complete one I/O at a time, which runs foul of per-task plugging. If you tell it to submit/complete I/O in big enough batches, you should see object size I/Os: $ sudo rbd map foo $ echo none | sudo tee /sys/block/rbd0/queue/scheduler none $ fio --name=seq4k --filename=/dev/rbd0 --ioengine=libaio --iodepth=1024 --rw=read --bs=4k --direct=1 --numjobs=1 ... Run status group 0 (all jobs): READ: bw=25.5MiB/s (26.8MB/s), 25.5MiB/s-25.5MiB/s (26.8MB/s-26.8MB/s), io=1024MiB (1074MB), run=40137-40137msec Disk stats (read/write): rbd0: ios=262118/0, merge=0/0, ticks=5023012/0, in_queue=4892398, util=44.23% $ fio --name=seq4k --filename=/dev/rbd0 --ioengine=libaio --iodepth=1024 --iodepth_batch_submit=1024 --iodepth_batch_complete=1024 --rw=read --bs=4k --direct=1 --numjobs=1 ... Run status group 0 (all jobs): READ: bw=83.1MiB/s (87.2MB/s), 83.1MiB/s-83.1MiB/s (87.2MB/s-87.2MB/s), io=1024MiB (1074MB), run=12317-12317msec Disk stats (read/write): rbd0: ios=252/0, merge=257796/0, ticks=11002/0, in_queue=10873, util=18.48% No merges in the first run and literally everything is merged (4M I/Os) in the second run. Thanks, Ilya