Re: [bug report] INFO: task mdX_resync:42168 blocked for more than 122 seconds

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

在 2024/05/20 10:55, Yu Kuai 写道:
Hi, Changhui

在 2024/05/20 8:39, Changhui Zhong 写道:
[czhong@vm linux-block]$ git bisect bad
060406c61c7cb4bbd82a02d179decca9c9bb3443 is the first bad commit
commit 060406c61c7cb4bbd82a02d179decca9c9bb3443
Author: Yu Kuai<yukuai3@xxxxxxxxxx>
Date:   Thu May 9 20:38:25 2024 +0800

     block: add plug while submitting IO

     So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()      and __blkdev_direct_IO_async(), block layer can still benefit from caching
     nsec time in the plug.

     Signed-off-by: Yu Kuai<yukuai3@xxxxxxxxxx>
Link:https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@xxxxxxxxxxxxxxx
     Signed-off-by: Jens Axboe<axboe@xxxxxxxxx>

  block/blk-core.c | 6 ++++++
  1 file changed, 6 insertions(+)

Thanks for the test!

I was surprised to see this blamed commit, and after taking a look at
raid1 barrier code, I found that there are some known problems, fixed in
raid10, while raid1 still unfixed. So I wonder this patch maybe just
making the exist problem easier to reporduce.

I'll start cooking patches to sync raid10 fixes to raid1, meanwhile,
can you change your script to test raid10 as well, if raid10 is fine,
I'll give you these patches later to test raid1.

Hi,

Sorry to ask, but since I can't reporduce the problem, and based on
code reiview, there are multiple potential problems, can you also
reporduce the problem with following debug patch(just add some debug
info, no functional changes). So that I can make sure of details of
the problem.

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 113135e7b5f2..b35b847a9e8b 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -936,6 +936,45 @@ static void flush_pending_writes(struct r1conf *conf)
                spin_unlock_irq(&conf->device_lock);
 }

+static bool waiting_barrier(struct r1conf *conf, int idx)
+{
+       int nr = atomic_read(&conf->nr_waiting[idx]);
+
+       if (nr) {
+               printk("%s: idx %d nr_waiting %d\n", __func__, idx, nr);
+               return true;
+       }
+
+       return false;
+}
+
+static bool waiting_pending(struct r1conf *conf, int idx)
+{
+       int nr;
+
+       if (test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery))
+               return false;
+
+       if (conf->array_frozen) {
+               printk("%s: array is frozen\n", __func__);
+               return true;
+       }
+
+       nr = atomic_read(&conf->nr_pending[idx]);
+       if (nr) {
+               printk("%s: idx %d nr_pending %d\n", __func__, idx, nr);
+               return true;
+       }
+
+       nr = atomic_read(&conf->barrier[idx]);
+       if (nr >= RESYNC_DEPTH) {
+ printk("%s: idx %d barrier %d exceeds %d\n", __func__, idx, nr, RESYNC_DEPTH);
+               return true;
+       }
+
+       return false;
+}
+
 /* Barriers....
  * Sometimes we need to suspend IO while we do something else,
  * either some resync/recovery, or reconfigure the array.
@@ -967,8 +1006,7 @@ static int raise_barrier(struct r1conf *conf, sector_t sector_nr)
        spin_lock_irq(&conf->resync_lock);

        /* Wait until no block IO is waiting */
-       wait_event_lock_irq(conf->wait_barrier,
-                           !atomic_read(&conf->nr_waiting[idx]),
+       wait_event_lock_irq(conf->wait_barrier, !waiting_barrier(conf, idx),
                            conf->resync_lock);

        /* block any new IO from starting */
@@ -990,11 +1028,7 @@ static int raise_barrier(struct r1conf *conf, sector_t sector_nr)
         * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning reaches
         *    max resync count which allowed on current I/O barrier bucket.
         */
-       wait_event_lock_irq(conf->wait_barrier,
-                           (!conf->array_frozen &&
-                            !atomic_read(&conf->nr_pending[idx]) &&
- atomic_read(&conf->barrier[idx]) < RESYNC_DEPTH) || - test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery),
+       wait_event_lock_irq(conf->wait_barrier, !waiting_pending(conf, idx),
                            conf->resync_lock);

        if (test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {

Thanks,
Kuai


Thanks,
Kuai

.






[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux