Re: [PATCH v2 3/3] md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d

Song Liu <song@xxxxxxxxxx> · Mon, 19 Sep 2022 11:31:40 -0700

On Thu, Sep 15, 2022 at 8:15 AM Logan Gunthorpe <logang@xxxxxxxxxxxx> wrote:
>
>
> Hi Song,
>
> On 2022-09-08 10:15, Logan Gunthorpe wrote:
> > A complicated deadlock exists when using the journal and an elevated
> > group_thrtead_cnt. It was found with loop devices, but its not clear
> > whether it can be seen with real disks. The deadlock can occur simply
> > by writing data with an fio script.
> >
> > When the deadlock occurs, multiple threads will hang in different ways:
> >
> >  1) The group threads will hang in the blk-wbt code with bios waiting to
> >     be submitted to the block layer:
> >
> >         io_schedule+0x70/0xb0
> >         rq_qos_wait+0x153/0x210
> >         wbt_wait+0x115/0x1b0
> >         io_schedule+0x70/0xb0
> >         rq_qos_wait+0x153/0x210
> >         wbt_wait+0x115/0x1b0
> >         __rq_qos_throttle+0x38/0x60
> >         blk_mq_submit_bio+0x589/0xcd0
> >         wbt_wait+0x115/0x1b0
> >         __rq_qos_throttle+0x38/0x60
> >         blk_mq_submit_bio+0x589/0xcd0
> >         __submit_bio+0xe6/0x100
> >         submit_bio_noacct_nocheck+0x42e/0x470
> >         submit_bio_noacct+0x4c2/0xbb0
> >         ops_run_io+0x46b/0x1a30
> >         handle_stripe+0xcd3/0x36b0
> >         handle_active_stripes.constprop.0+0x6f6/0xa60
> >         raid5_do_work+0x177/0x330
> >
> >     Or:
> >         io_schedule+0x70/0xb0
> >         rq_qos_wait+0x153/0x210
> >         wbt_wait+0x115/0x1b0
> >         __rq_qos_throttle+0x38/0x60
> >         blk_mq_submit_bio+0x589/0xcd0
> >         __submit_bio+0xe6/0x100
> >         submit_bio_noacct_nocheck+0x42e/0x470
> >         submit_bio_noacct+0x4c2/0xbb0
> >         flush_deferred_bios+0x136/0x170
> >         raid5_do_work+0x262/0x330
> >
> >  2) The r5l_reclaim thread will hang in the same way, submitting a
> >     bio to the block layer:
> >
> >         io_schedule+0x70/0xb0
> >         rq_qos_wait+0x153/0x210
> >         wbt_wait+0x115/0x1b0
> >         __rq_qos_throttle+0x38/0x60
> >         blk_mq_submit_bio+0x589/0xcd0
> >         __submit_bio+0xe6/0x100
> >         submit_bio_noacct_nocheck+0x42e/0x470
> >         submit_bio_noacct+0x4c2/0xbb0
> >         submit_bio+0x3f/0xf0
> >         md_super_write+0x12f/0x1b0
> >         md_update_sb.part.0+0x7c6/0xff0
> >         md_update_sb+0x30/0x60
> >         r5l_do_reclaim+0x4f9/0x5e0
> >         r5l_reclaim_thread+0x69/0x30b
> >
> >     However, before hanging, the MD_SB_CHANGE_PENDING flag will be
> >     set for sb_flags in r5l_write_super_and_discard_space(). This
> >     flag will never be cleared because the submit_bio() call never
> >     returns.
> >
> >  3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
> >     will do no processing on any pending stripes and re-set
> >     STRIPE_HANDLE. This will cause the raid5d thread to enter an
> >     infinite loop, constantly trying to handle the same stripes
> >     stuck in the queue.
> >
> >     The raid5d thread has a blk_plug that holds a number of bios
> >     that are also stuck waiting seeing the thread is in a loop
> >     that never schedules. These bios have been accounted for by
> >     blk-wbt thus preventing the other threads above from
> >     continuing when they try to submit bios. --Deadlock.
> >
> > To fix this, add the same wait_event() that is used in raid5_do_work()
> > to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
> > schedule and wait until the flag is cleared. The schedule action will
> > flush the plug which will allow the r5l_reclaim thread to continue,
> > thus preventing the deadlock.
> >
> > It's not clear when the deadlock was introduced, but the similar
> > wait_event() call in raid5_do_work() was added in 2017 by this
> > commit:
> >
> >     16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata
> >                    is updated.")
> >
> > Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@xxxxxxxxxxxx
> > Signed-off-by: Logan Gunthorpe <logang@xxxxxxxxxxxx>
>
> Seems this patch of mine that is in md-next is causing an 1 in ~30
> failure on the mdadm test 13imsm-r0_r5_3d-grow-r0_r5_4d.
>
> I'm looking into it and will try to send an updated patch when I have a fix.

Thanks for the heads-up! I will remove this one from md-next for now.

Song