Re: [External] Re: raid5 deadlock issue

Zhang Tianci <zhangtianci.1997@xxxxxxxxxxxxx> · Fri, 11 Nov 2022 14:26:54 +0800

On Fri, Nov 11, 2022 at 12:14 PM Song Liu <song@xxxxxxxxxx> wrote:
>
> On Thu, Nov 10, 2022 at 10:25 AM Zhang Tianci
> <zhangtianci.1997@xxxxxxxxxxxxx> wrote:
> >
>
> > > fio -filename=testfile -ioengine=libaio -bs=16M -size=10G -numjobs=100
> > > -iodepth=1 -runtime=60
> > > -rw=write -group_reporting -name="test"
> > >
> > > Then I found the first deadlock state, but it is not the real reason.
> > >
> > > I will do a test with the latest kernel. I will report to you the result later.
> > >
> > I can reproduce the first deadlock in linux-6.1-rc4.
> > There are 26 stripe_head and 26 fio threads blocked with same backtrace:
> >
> >  #0 [ffffc9000cd0f8b0] __schedule at ffffffff818b3c3c
> >  #1 [ffffc9000cd0f940] schedule at ffffffff818b4313
> >  #2 [ffffc9000cd0f950] md_bitmap_startwrite at ffffffffc063354a [md_mod]
> >  #3 [ffffc9000cd0f9c0] __add_stripe_bio at ffffffffc064fbd6 [raid456]
> >  #4 [ffffc9000cd0fa00] raid5_make_request at ffffffffc065a84c [raid456]
> >  #5 [ffffc9000cd0fb30] md_handle_request at ffffffffc0628496 [md_mod]
> >  #6 [ffffc9000cd0fb98] __submit_bio at ffffffff813f308f
> >  #7 [ffffc9000cd0fbb8] submit_bio_noacct_nocheck at ffffffff813f3501
> >  #8 [ffffc9000cd0fc00] __block_write_full_page at ffffffff8134ca64
> >  #9 [ffffc9000cd0fc60] __writepage at ffffffff8123f4a3
> > #10 [ffffc9000cd0fc78] write_cache_pages at ffffffff8123fb57
> > #11 [ffffc9000cd0fd70] generic_writepages at ffffffff8123feef
> > #12 [ffffc9000cd0fdc0] do_writepages at ffffffff81241f12
> > #13 [ffffc9000cd0fe28] filemap_fdatawrite_wbc at ffffffff8123306b
> > #14 [ffffc9000cd0fe48] __filemap_fdatawrite_range at ffffffff81239154
> > #15 [ffffc9000cd0fec0] file_write_and_wait_range at ffffffff812393e1
> > #16 [ffffc9000cd0fef0] blkdev_fsync at ffffffff813ec223
> > #17 [ffffc9000cd0ff08] do_fsync at ffffffff81342798
> > #18 [ffffc9000cd0ff30] __x64_sys_fsync at ffffffff813427e0
> > #19 [ffffc9000cd0ff38] do_syscall_64 at ffffffff818a6114
> > #20 [ffffc9000cd0ff50] entry_SYSCALL_64_after_hwframe at ffffffff81a0009b
>
> Thanks for this information.
>
> I guess this is with COUNTER_MAX of 4? And it is slightly different to the
> issue you found?

Yes, I hack COUNTER_MAX to 4, I think this could increase the
probability of bitmap
counter racing.

And this kind of deadlock is very difficult to happen without hacking.

It just happened when I debugged, but it help me find a guess(the
second deadlock state
in the first email) about the real reason.

>
> I will try to look into this next week (taking some time off this week).

Thanks,
Tianci