On Mon, Nov 15, 2021 at 3:58 PM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > On Mon, Nov 15, 2021 at 02:01:32PM +0100, Haris Iqbal wrote: > > On Wed, Nov 10, 2021 at 2:39 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > > > > > Hello Haris, > > > > > > On Tue, Nov 09, 2021 at 10:32:32AM +0100, Haris Iqbal wrote: > > > > Hi, > > > > > > > > We are observing an fio hang with the latest v5.10.78 Linux kernel > > > > version with RNBD. The setup is as follows, > > > > > > > > On the server side, 16 nullblk devices. > > > > On the client side, map those 16 block devices through RNBD-RTRS. > > > > Change the scheduler for those RNBD block devices to mq-deadline. > > > > > > > > Run fios with the following configuration. > > > > > > > > [global] > > > > description=Emulation of Storage Server Access Pattern > > > > bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4 > > > > fadvise_hint=0 > > > > rw=randrw:2 > > > > direct=1 > > > > random_distribution=zipf:1.2 > > > > time_based=1 > > > > runtime=60 > > > > ramp_time=1 > > > > ioengine=libaio > > > > iodepth=128 > > > > iodepth_batch_submit=128 > > > > iodepth_batch_complete=128 > > > > numjobs=1 > > > > group_reporting > > > > > > > > > > > > [job1] > > > > filename=/dev/rnbd0 > > > > [job2] > > > > filename=/dev/rnbd1 > > > > [job3] > > > > filename=/dev/rnbd2 > > > > [job4] > > > > filename=/dev/rnbd3 > > > > [job5] > > > > filename=/dev/rnbd4 > > > > [job6] > > > > filename=/dev/rnbd5 > > > > [job7] > > > > filename=/dev/rnbd6 > > > > [job8] > > > > filename=/dev/rnbd7 > > > > [job9] > > > > filename=/dev/rnbd8 > > > > [job10] > > > > filename=/dev/rnbd9 > > > > [job11] > > > > filename=/dev/rnbd10 > > > > [job12] > > > > filename=/dev/rnbd11 > > > > [job13] > > > > filename=/dev/rnbd12 > > > > [job14] > > > > filename=/dev/rnbd13 > > > > [job15] > > > > filename=/dev/rnbd14 > > > > [job16] > > > > filename=/dev/rnbd15 > > > > > > > > Some of the fio threads hangs and the fio never finishes. > > > > > > > > fio fio.ini > > > > job1: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job2: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job3: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job4: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job5: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job6: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job7: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job8: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job9: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job10: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job11: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job12: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job13: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job14: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job15: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > job16: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T) > > > > 512B-64.0KiB, ioengine=libaio, iodepth=128 > > > > fio-3.12 > > > > Starting 16 processes > > > > Jobs: 16 (f=12): > > > > [m(3),/(2),m(5),/(1),m(1),/(1),m(3)][0.0%][r=130MiB/s,w=130MiB/s][r=14.7k,w=14.7k > > > > IOPS][eta 04d:07h:4 > > > > Jobs: 15 (f=11): > > > > [m(3),/(2),m(5),/(1),_(1),/(1),m(3)][51.2%][r=7395KiB/s,w=6481KiB/s][r=770,w=766 > > > > IOPS][eta 01m:01s] > > > > Jobs: 15 (f=11): [m(3),/(2),m(5),/(1),_(1),/(1),m(3)][52.7%][eta 01m:01s] > > > > > > > > We checked the block devices, and there are requests waiting in their > > > > fifo (not on all devices, just few whose corresponding fio threads are > > > > hung). > > > > > > > > $ cat /sys/kernel/debug/block/rnbd0/sched/read_fifo_list > > > > 00000000ce398aec {.op=READ, .cmd_flags=, > > > > .rq_flags=SORTED|ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1, > > > > .internal_tag=209} > > > > 000000005ec82450 {.op=READ, .cmd_flags=, > > > > .rq_flags=SORTED|ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1, > > > > .internal_tag=210} > > > > > > > > $ cat /sys/kernel/debug/block/rnbd0/sched/write_fifo_list > > > > 000000000c1557f5 {.op=WRITE, .cmd_flags=SYNC|IDLE, > > > > .rq_flags=SORTED|ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1, > > > > .internal_tag=195} > > > > 00000000fc6bfd98 {.op=WRITE, .cmd_flags=SYNC|IDLE, > > > > .rq_flags=SORTED|ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1, > > > > .internal_tag=199} > > > > 000000009ef7c802 {.op=WRITE, .cmd_flags=SYNC|IDLE, > > > > .rq_flags=SORTED|ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1, > > > > .internal_tag=217} > > > > > > Can you post the whole debugfs log for rnbd0? > > > > > > (cd /sys/kernel/debug/block/rnbd0 && find . -type f -exec grep -aH . {} \;) > > > > Attached the logfile. > > logfile just shows that there are requests in scheduler queue. > > > > > > > > > > > > > > > > > > Potential points which fixes the hang > > > > > > > > 1) Using no scheduler (none) on the client side RNBD block devices > > > > results in no hang. > > > > > > > > 2) In the fio config, changing the line "iodepth_batch_complete=128" > > > > to the following fixes the hang, > > > > iodepth_batch_complete_min=1 > > > > iodepth_batch_complete_max=128 > > > > OR, > > > > iodepth_batch_complete=0 > > > > > > > > 3) We also tracked down the version from which the hang started. The > > > > hang started with v5.10.50, and the following commit was one which > > > > results in the hang > > > > > > > > commit 512106ae2355813a5eb84e8dc908628d52856890 > > > > Author: Ming Lei <ming.lei@xxxxxxxxxx> > > > > Date: Fri Jun 25 10:02:48 2021 +0800 > > > > > > > > blk-mq: update hctx->dispatch_busy in case of real scheduler > > > > > > > > [ Upstream commit cb9516be7708a2a18ec0a19fe3a225b5b3bc92c7 ] > > > > > > > > Commit 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io") > > > > starts to support io batching submission by using hctx->dispatch_busy. > > > > > > > > However, blk_mq_update_dispatch_busy() isn't changed to update > > > > hctx->dispatch_busy > > > > in that commit, so fix the issue by updating hctx->dispatch_busy in case > > > > of real scheduler. > > > > > > > > Reported-by: Jan Kara <jack@xxxxxxx> > > > > Reviewed-by: Jan Kara <jack@xxxxxxx> > > > > Fixes: 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io") > > > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > > > Link: https://lore.kernel.org/r/20210625020248.1630497-1-ming.lei@xxxxxxxxxx > > > > Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> > > > > Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx> > > > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > > > index 00d6ed2fe812..a368eb6dc647 100644 > > > > --- a/block/blk-mq.c > > > > +++ b/block/blk-mq.c > > > > @@ -1242,9 +1242,6 @@ static void blk_mq_update_dispatch_busy(struct > > > > blk_mq_hw_ctx *hctx, bool busy) > > > > { > > > > unsigned int ewma; > > > > > > > > - if (hctx->queue->elevator) > > > > - return; > > > > - > > > > ewma = hctx->dispatch_busy; > > > > > > > > if (!ewma && !busy) > > > > > > > > We reverted the commit and tested and there is no hang. > > > > > > > > 4) Lastly, we tested newer version like 5.13, and there is NO hang in > > > > that also. Hence, probably some other change fixed it. > > > > > > Can you observe the issue on v5.10? Maybe there is one pre-patch of commit cb9516be7708 > > > ("blk-mq: update hctx->dispatch_busy in case of real scheduler merged") > > > which is missed to 5.10.y. > > > > If you mean v5.10.0, then no, I see no hang there. As I mentioned > > before, there is no hang till v5.10.49. > > > > > > > > And not remember that there is fix for commit cb9516be7708 in mainline. > > > > > > commit cb9516be7708 is merged to v5.14, instead of v5.13, did you test v5.14 or v5.15? > > > > > > BTW, commit cb9516be7708 should just affect performance, not supposed to cause > > > hang. > > > > True. It does look like that from the small code change. > > ->dispatch_busy just affects the batching size for dequeuing request > from scheduler queue, see the following code in > __blk_mq_do_dispatch_sched(): > > if (hctx->dispatch_busy) > max_dispatch = 1; > else > max_dispatch = hctx->queue->nr_requests; > > > Also see blk_mq_do_dispatch_sched(): > > static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) > { > int ret; > > do { > ret = __blk_mq_do_dispatch_sched(hctx); > } while (ret == 1); > > return ret; > } > > If any request can be dispatched to driver, blk-mq will continue to > dispatch until -EAGAIN or 0 is returned from __blk_mq_do_dispatch_sched(). > > In case of -EAGAIN, blk-mq will try again to dispatch request and run > queue asynchronously if another -EAGAIN is returend. > > In case of 0 returned, blk_mq_dispatch_rq_list() can't make progress, > the request will be moved to hctx->dispatch, and blk-mq covers the > re-run uueue until all requests in hctx->dispatch are dispatched, then > still dispatch request from scheduler queue. > > So the current code has provided forward-progress, and I don't see how > patch 'blk-mq: update hctx->dispatch_busy in case of real scheduler' can > cause this issue. > > Also not see any request in hctx->dispatch from debugfs log. > > > > > I wasn't able to test in v5.14 and v5.15 because we are seeing some > > other errors in those versions, most probably related to the > > rdma-core/rxe driver. > > I'd rather see test result on upstream kernel of v5.14 or v5.15 or recent > v5.15-rc. We were finally able to test in 5.15.2, and there is no hang. > > > > Thanks, > Ming >