Re: Observing an fio hang with RNBD device in the latest v5.10.78 Linux kernel

Haris Iqbal <haris.iqbal@xxxxxxxxx> · Mon, 15 Nov 2021 14:01:32 +0100

On Wed, Nov 10, 2021 at 2:39 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
>
> Hello Haris,
>
> On Tue, Nov 09, 2021 at 10:32:32AM +0100, Haris Iqbal wrote:
> > Hi,
> >
> > We are observing an fio hang with the latest v5.10.78 Linux kernel
> > version with RNBD. The setup is as follows,
> >
> > On the server side, 16 nullblk devices.
> > On the client side, map those 16 block devices through RNBD-RTRS.
> > Change the scheduler for those RNBD block devices to mq-deadline.
> >
> > Run fios with the following configuration.
> >
> > [global]
> > description=Emulation of Storage Server Access Pattern
> > bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4
> > fadvise_hint=0
> > rw=randrw:2
> > direct=1
> > random_distribution=zipf:1.2
> > time_based=1
> > runtime=60
> > ramp_time=1
> > ioengine=libaio
> > iodepth=128
> > iodepth_batch_submit=128
> > iodepth_batch_complete=128
> > numjobs=1
> > group_reporting
> >
> >
> > [job1]
> > filename=/dev/rnbd0
> > [job2]
> > filename=/dev/rnbd1
> > [job3]
> > filename=/dev/rnbd2
> > [job4]
> > filename=/dev/rnbd3
> > [job5]
> > filename=/dev/rnbd4
> > [job6]
> > filename=/dev/rnbd5
> > [job7]
> > filename=/dev/rnbd6
> > [job8]
> > filename=/dev/rnbd7
> > [job9]
> > filename=/dev/rnbd8
> > [job10]
> > filename=/dev/rnbd9
> > [job11]
> > filename=/dev/rnbd10
> > [job12]
> > filename=/dev/rnbd11
> > [job13]
> > filename=/dev/rnbd12
> > [job14]
> > filename=/dev/rnbd13
> > [job15]
> > filename=/dev/rnbd14
> > [job16]
> > filename=/dev/rnbd15
> >
> > Some of the fio threads hangs and the fio never finishes.
> >
> > fio fio.ini
> > job1: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job2: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job3: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job4: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job5: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job6: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job7: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job8: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job9: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job10: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job11: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job12: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job13: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job14: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job15: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > job16: (g=0): rw=randrw, bs=(R) 512B-64.0KiB, (W) 512B-64.0KiB, (T)
> > 512B-64.0KiB, ioengine=libaio, iodepth=128
> > fio-3.12
> > Starting 16 processes
> > Jobs: 16 (f=12):
> > [m(3),/(2),m(5),/(1),m(1),/(1),m(3)][0.0%][r=130MiB/s,w=130MiB/s][r=14.7k,w=14.7k
> > IOPS][eta 04d:07h:4
> > Jobs: 15 (f=11):
> > [m(3),/(2),m(5),/(1),_(1),/(1),m(3)][51.2%][r=7395KiB/s,w=6481KiB/s][r=770,w=766
> > IOPS][eta 01m:01s]
> > Jobs: 15 (f=11): [m(3),/(2),m(5),/(1),_(1),/(1),m(3)][52.7%][eta 01m:01s]
> >
> > We checked the block devices, and there are requests waiting in their
> > fifo (not on all devices, just few whose corresponding fio threads are
> > hung).
> >
> > $ cat /sys/kernel/debug/block/rnbd0/sched/read_fifo_list
> > 00000000ce398aec {.op=READ, .cmd_flags=,
> > .rq_flags=SORTED|ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1,
> > .internal_tag=209}
> > 000000005ec82450 {.op=READ, .cmd_flags=,
> > .rq_flags=SORTED|ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1,
> > .internal_tag=210}
> >
> > $ cat /sys/kernel/debug/block/rnbd0/sched/write_fifo_list
> > 000000000c1557f5 {.op=WRITE, .cmd_flags=SYNC|IDLE,
> > .rq_flags=SORTED|ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1,
> > .internal_tag=195}
> > 00000000fc6bfd98 {.op=WRITE, .cmd_flags=SYNC|IDLE,
> > .rq_flags=SORTED|ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1,
> > .internal_tag=199}
> > 000000009ef7c802 {.op=WRITE, .cmd_flags=SYNC|IDLE,
> > .rq_flags=SORTED|ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1,
> > .internal_tag=217}
>
> Can you post the whole debugfs log for rnbd0?
>
> (cd /sys/kernel/debug/block/rnbd0 && find . -type f -exec grep -aH . {} \;)

Attached the logfile.

>
> >
> >
> > Potential points which fixes the hang
> >
> > 1) Using no scheduler (none) on the client side RNBD block devices
> > results in no hang.
> >
> > 2) In the fio config, changing the line "iodepth_batch_complete=128"
> > to the following fixes the hang,
> > iodepth_batch_complete_min=1
> > iodepth_batch_complete_max=128
> > OR,
> > iodepth_batch_complete=0
> >
> > 3) We also tracked down the version from which the hang started. The
> > hang started with v5.10.50, and the following commit was one which
> > results in the hang
> >
> > commit 512106ae2355813a5eb84e8dc908628d52856890
> > Author: Ming Lei <ming.lei@xxxxxxxxxx>
> > Date:   Fri Jun 25 10:02:48 2021 +0800
> >
> >     blk-mq: update hctx->dispatch_busy in case of real scheduler
> >
> >     [ Upstream commit cb9516be7708a2a18ec0a19fe3a225b5b3bc92c7 ]
> >
> >     Commit 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io")
> >     starts to support io batching submission by using hctx->dispatch_busy.
> >
> >     However, blk_mq_update_dispatch_busy() isn't changed to update
> > hctx->dispatch_busy
> >     in that commit, so fix the issue by updating hctx->dispatch_busy in case
> >     of real scheduler.
> >
> >     Reported-by: Jan Kara <jack@xxxxxxx>
> >     Reviewed-by: Jan Kara <jack@xxxxxxx>
> >     Fixes: 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io")
> >     Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> >     Link: https://lore.kernel.org/r/20210625020248.1630497-1-ming.lei@xxxxxxxxxx
> >     Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
> >     Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 00d6ed2fe812..a368eb6dc647 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1242,9 +1242,6 @@ static void blk_mq_update_dispatch_busy(struct
> > blk_mq_hw_ctx *hctx, bool busy)
> >  {
> >         unsigned int ewma;
> >
> > -       if (hctx->queue->elevator)
> > -               return;
> > -
> >         ewma = hctx->dispatch_busy;
> >
> >         if (!ewma && !busy)
> >
> > We reverted the commit and tested and there is no hang.
> >
> > 4) Lastly, we tested newer version like 5.13, and there is NO hang in
> > that also. Hence, probably some other change fixed it.
>
> Can you observe the issue on v5.10? Maybe there is one pre-patch of commit cb9516be7708
> ("blk-mq: update hctx->dispatch_busy in case of real scheduler merged")
> which is missed to 5.10.y.

If you mean v5.10.0, then no, I see no hang there. As I mentioned
before, there is no hang till v5.10.49.

>
> And not remember that there is fix for commit cb9516be7708 in mainline.
>
> commit cb9516be7708 is merged to v5.14, instead of v5.13, did you test v5.14 or v5.15?
>
> BTW, commit cb9516be7708 should just affect performance, not supposed to cause
> hang.

True. It does look like that from the small code change.

I wasn't able to test in v5.14 and v5.15 because we are seeing some
other errors in those versions, most probably related to the
rdma-core/rxe driver.

>
> Thanks,
> Ming
>

PS: I am sending this again since I keep getting this notification
that the delivery failed for my last response.
Attachment:
debugfslog_rnbd0

Description: Binary data