Patch "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d" has been added to the 5.15-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Sun, 16 Oct 2022 23:27:14 -0400

This is a note to let you know that I've just added the patch titled

    md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d

to the 5.15-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     md-raid5-wait-for-md_sb_change_pending-in-raid5d.patch
and it can be found in the queue-5.15 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit a95a08cf6ada7594444ca7b01c710bfb01065623
Author: Logan Gunthorpe <logang@xxxxxxxxxxxx>
Date:   Wed Sep 21 10:28:37 2022 -0600

    md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d
    
    [ Upstream commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74 ]
    
    A complicated deadlock exists when using the journal and an elevated
    group_thrtead_cnt. It was found with loop devices, but its not clear
    whether it can be seen with real disks. The deadlock can occur simply
    by writing data with an fio script.
    
    When the deadlock occurs, multiple threads will hang in different ways:
    
     1) The group threads will hang in the blk-wbt code with bios waiting to
        be submitted to the block layer:
    
            io_schedule+0x70/0xb0
            rq_qos_wait+0x153/0x210
            wbt_wait+0x115/0x1b0
            io_schedule+0x70/0xb0
            rq_qos_wait+0x153/0x210
            wbt_wait+0x115/0x1b0
            __rq_qos_throttle+0x38/0x60
            blk_mq_submit_bio+0x589/0xcd0
            wbt_wait+0x115/0x1b0
            __rq_qos_throttle+0x38/0x60
            blk_mq_submit_bio+0x589/0xcd0
            __submit_bio+0xe6/0x100
            submit_bio_noacct_nocheck+0x42e/0x470
            submit_bio_noacct+0x4c2/0xbb0
            ops_run_io+0x46b/0x1a30
            handle_stripe+0xcd3/0x36b0
            handle_active_stripes.constprop.0+0x6f6/0xa60
            raid5_do_work+0x177/0x330
    
        Or:
            io_schedule+0x70/0xb0
            rq_qos_wait+0x153/0x210
            wbt_wait+0x115/0x1b0
            __rq_qos_throttle+0x38/0x60
            blk_mq_submit_bio+0x589/0xcd0
            __submit_bio+0xe6/0x100
            submit_bio_noacct_nocheck+0x42e/0x470
            submit_bio_noacct+0x4c2/0xbb0
            flush_deferred_bios+0x136/0x170
            raid5_do_work+0x262/0x330
    
     2) The r5l_reclaim thread will hang in the same way, submitting a
        bio to the block layer:
    
            io_schedule+0x70/0xb0
            rq_qos_wait+0x153/0x210
            wbt_wait+0x115/0x1b0
            __rq_qos_throttle+0x38/0x60
            blk_mq_submit_bio+0x589/0xcd0
            __submit_bio+0xe6/0x100
            submit_bio_noacct_nocheck+0x42e/0x470
            submit_bio_noacct+0x4c2/0xbb0
            submit_bio+0x3f/0xf0
            md_super_write+0x12f/0x1b0
            md_update_sb.part.0+0x7c6/0xff0
            md_update_sb+0x30/0x60
            r5l_do_reclaim+0x4f9/0x5e0
            r5l_reclaim_thread+0x69/0x30b
    
        However, before hanging, the MD_SB_CHANGE_PENDING flag will be
        set for sb_flags in r5l_write_super_and_discard_space(). This
        flag will never be cleared because the submit_bio() call never
        returns.
    
     3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
        will do no processing on any pending stripes and re-set
        STRIPE_HANDLE. This will cause the raid5d thread to enter an
        infinite loop, constantly trying to handle the same stripes
        stuck in the queue.
    
        The raid5d thread has a blk_plug that holds a number of bios
        that are also stuck waiting seeing the thread is in a loop
        that never schedules. These bios have been accounted for by
        blk-wbt thus preventing the other threads above from
        continuing when they try to submit bios. --Deadlock.
    
    To fix this, add the same wait_event() that is used in raid5_do_work()
    to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
    schedule and wait until the flag is cleared. The schedule action will
    flush the plug which will allow the r5l_reclaim thread to continue,
    thus preventing the deadlock.
    
    However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING
    from the same thread and can thus deadlock if the thread is put to
    sleep. So avoid waiting if md_check_recovery() is being called in the
    loop.
    
    It's not clear when the deadlock was introduced, but the similar
    wait_event() call in raid5_do_work() was added in 2017 by this
    commit:
    
        16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata
                       is updated.")
    
    Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@xxxxxxxxxxxx
    Signed-off-by: Logan Gunthorpe <logang@xxxxxxxxxxxx>
    Signed-off-by: Song Liu <song@xxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d9ad66c2c8e1..7a849a4b7085 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -36,6 +36,7 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/delay.h>
 #include <linux/kthread.h>
 #include <linux/raid/pq.h>
 #include <linux/async_tx.h>
@@ -6518,7 +6519,18 @@ static void raid5d(struct md_thread *thread)
 			spin_unlock_irq(&conf->device_lock);
 			md_check_recovery(mddev);
 			spin_lock_irq(&conf->device_lock);
+
+			/*
+			 * Waiting on MD_SB_CHANGE_PENDING below may deadlock
+			 * seeing md_check_recovery() is needed to clear
+			 * the flag when using mdmon.
+			 */
+			continue;
 		}
+
+		wait_event_lock_irq(mddev->sb_wait,
+			!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags),
+			conf->device_lock);
 	}
 	pr_debug("%d stripes handled\n", handled);