Re: [PATCH] rbd: support .get_budget and .put_budget in rbd_mq_ops

Ilya Dryomov <idryomov@xxxxxxxxx> · Fri, 12 Apr 2019 16:45:27 +0200

On Fri, Apr 12, 2019 at 7:01 AM Dongsheng Yang
<dongsheng.yang@xxxxxxxxxxxx> wrote:
>
>
>
> On 04/11/2019 05:37 PM, Ilya Dryomov wrote:
> > On Thu, Apr 11, 2019 at 5:46 AM Dongsheng Yang
> > <dongsheng.yang@xxxxxxxxxxxx> wrote:
> >> To improve sequential IO of rbd device, we wish there are
> >> as more chance to merge IO as possible.
> >>
> >> But when the blck_mq_make_request() try to merge with IOs in Software
> >> queue by blk_mq_sched_bio_merge(), it would find software queue
> >> is empty at most cases.
> >>
> >> The reason is when blk_mq call blk_mq_run_hw_queue() to dispatch requests
> >> from SQ to driver, it will call .queue_rq() implemented in rbd driver.
> >> And we only queue a work in rbd_mq_ops->queue_rq() without possible to
> >> return BLK_STS_DEV_RESOURCE or BLK_STS_RESOURCE. Then the requests in
> >> SQ will be dequeued in anyway. That means in most case, Software is
> >> empty when the next request want to do a IO merge.
> >>
> >> To improve it, blk_mq provides .get_budget and .put_budget in blk_mq_ops.
> >> Then request need to get budget at first when dequeue from software queue,
> >> if it return false, request will be remained in software queue and wait
> >> for next dispatch.
> >>
> >> So this commit introduce an option of busy_threshold in rbd driver, and
> >> implement .get_budget and .put_budget in rbd_mq_ops. We will increase
> >> budget_reserved in .get_budget and decrese budget_reserved in .put_budget.
> >> When the budget_reserved is larger than busy_threshold, return false in
> >> .get_budget to indecate rbd device is busy now.
> >>
> >> Signed-off-by: Dongsheng Yang <dongsheng.yang@xxxxxxxxxxxx>
> >> ---
> >>          In my testing, we can get about 60% improvement in seq reading and
> >> similar level in rand reading, with -o busy_threshold=64. (bs=4K iodepth=128 numjobs=1)
> > Hi Dongsheng,
> >
> > Have you tried running the same test with -o queue_depth=64?  I thought
> > get_budget was supposed to be a SCSI-specific workaround for a very big
> > tagset shared between many individual devices capable of handling only
> > a couple of requests at the same time.  For rbd, -o queue_depth controls
> > the size of the tagset for each individual device, so an extra layer of
> > budgeting shouldn't be needed?
> Hi Ilya,
>      The reason I introduced busy_threshold is that I found the nr_requests
> is same with queue_depth in my case. So even we use queue_depth=64,
> there is no chance to do merge, because the newer request can't be
> submitted into blk_mq.
>
> But I think I was wrong, I found that's true only in none scheduler what
> is used in my usecase. Any other scheduler
> will overwrite the nr_requests as double of queue_depth:
>
> q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
>                                     BLKDEV_MAX_RQ);
>
> So when we are using queue_depth=64, the nr_requests would be 128. Then
> when we do test with iodepth=128 numjobs=1, there are 64 IO inflight in rbd
> driver and the later 64 get a chance to be merged.
>
> Test result:
> (1) rbd map IMAGE -o queue_depth=64 (mq-deadline fio bs=4K iodepth=128
> numjobs=1)
> iops=17k mergedIO (85555)
> (2) rbd map IMAGE -o busy_threshold=64 (mq-daedline fio bs=4K
> iodepth=128 numjobs=1)
> iops=16.9K mergedIO (103374)
>
> The test result match what we analyzed: IO get merged with
> queue_depth=64. But
> the number of merged IO is smaller than busy_threshold=64. That's
> reasonable, as
> the nr_requests in case2 is 256, so we have (256 - 64) = 198 IOs in
> pending can get
> chance to be merged. But in case1, the nr_requests is 128, so, there is
> 64 IOs can
> get chance to merge. But busy_threshold case introduce .get_budget and
> .put_budget
> which will introduce some overhead. that result the iops is smaller than
> queue_depth=64
> case. So queue_depth=64 is better than busy_threshold=64 with
> mq-deadline scheduler.
>
> There is only None scheduler scenario the busy_threshold could be
> consider usable:
>       a) rbd map IMAGE -o queue_depth=64 (none fio bs=4K iodepth=128
> numjobs=1)
>           iops=9464, mergedIO(0)
>       b) rbd map IMAGE -o queue_depth=64 (none fio bs=4K iodepth=128
> numjobs=1)
>           iops=18314, mergedIO(100099)
>
> Hmmm, none merge in none scheduler sounds not a problem. So looks really
> not worthy to
>   introduce busy_threshold in upstream at all.

Right, get_budget is really just a workaround for SCSI.

Note that scheduler = none doesn't mean no merges.  With explicit
plugging merges will still happen even without a scheduler.

By default fio will submit/complete one I/O at a time, which runs
foul of per-task plugging.  If you tell it to submit/complete I/O in
big enough batches, you should see object size I/Os:

$ sudo rbd map foo
$ echo none | sudo tee /sys/block/rbd0/queue/scheduler
none

$ fio --name=seq4k --filename=/dev/rbd0 --ioengine=libaio
--iodepth=1024 --rw=read --bs=4k --direct=1 --numjobs=1
...
Run status group 0 (all jobs):
   READ: bw=25.5MiB/s (26.8MB/s), 25.5MiB/s-25.5MiB/s
(26.8MB/s-26.8MB/s), io=1024MiB (1074MB), run=40137-40137msec

Disk stats (read/write):
  rbd0: ios=262118/0, merge=0/0, ticks=5023012/0, in_queue=4892398, util=44.23%

$ fio --name=seq4k --filename=/dev/rbd0 --ioengine=libaio
--iodepth=1024 --iodepth_batch_submit=1024
--iodepth_batch_complete=1024 --rw=read --bs=4k --direct=1 --numjobs=1
...
Run status group 0 (all jobs):
   READ: bw=83.1MiB/s (87.2MB/s), 83.1MiB/s-83.1MiB/s
(87.2MB/s-87.2MB/s), io=1024MiB (1074MB), run=12317-12317msec

Disk stats (read/write):
  rbd0: ios=252/0, merge=257796/0, ticks=11002/0, in_queue=10873, util=18.48%

No merges in the first run and literally everything is merged (4M I/Os)
in the second run.

Thanks,

                Ilya