Re: [PATCH] md/raid5: fix a potential deadlock of raid5 reshape

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Guoqing Jiang 於 2017-12-28 15:59 寫到:
On 12/27/2017 01:47 PM, bingjingc wrote:

There is a potential deadlock if mount/umount happens when
raid5_finish_reshape() tries to grow the size of emulated disk.

How the deadlock happens?
1) The raid5 resync thread finished reshape (expanding array).
2) The mount or umount thread holds VFS sb->s_umount lock and tries to
   write through critical data into raid5 emulated block device. So it
   waits for raid5 kernel thread handling stripes in order to finish it
   I/Os.
3) In the routine of raid5 kernel thread, md_check_recovery() will be
   called first in order to reap the raid5 resync thread. That is,
   raid5_finish_reshape() will be called. In this function, it will try
   to update conf and call VFS revalidate_disk() to grow the raid5
   emulated block device. It will try to acquire VFS sb->s_umount lock.
The raid5 kernel thread cannot continue, so no one can handle mount/
umount I/Os (stripes). Once the write-through I/Os cannot be finished,
mount/umount will not release sb->s_umount lock. The deadlock happens.

The raid5 kernel thread is an emulated block device. It is responible to
handle I/Os (stripes) from upper layers. The emulated block device
should not request any I/Os on itself. That is, it should not call VFS
layer functions. (If it did, it will try to acquire VFS locks to
guarantee the I/Os sequence.) So we have the resync thread to send
resync I/O requests and to wait for the results.

For solving this potential deadlock, we can put the size growth of the
emulated block device as the final step of reshape thread.

Maybe another way to avoid the deadlock is to ensure the reshaping array can't be mount at all. One reason is, if the reshaping is finished, but fs still don't know about  the new array size since metadata is not refreshed. Just
my $0.02.


It may take hours or couples of days to expand an array. So I bet that it's a feature that an expanding array can still serve fs requests. Once it finished, the new array size for metadata can be passed to fs layer via revalidate_disk(). From then, fs is able to be expanded, too. There are no needs to restart that
array. Based on this point, I don't want to break this feature.


Reported-by: Alex Wu <alexwu@xxxxxxxxxxxx>
Reviewed-by: Alex Wu <alexwu@xxxxxxxxxxxx>
Reviewed-by: Chung-Chiang Cheng <cccheng@xxxxxxxxxxxx>
Signed-off-by: BingJing Chang <bingjingc@xxxxxxxxxxxx>
---
 drivers/md/md.c    | 13 +++++++++++++
 drivers/md/raid5.c |  8 +-------
 2 files changed, 14 insertions(+), 7 deletions(-)

I think raid10 has the similar issue, so at least you need to change
raid10 either.

Thank you for your notice. It should be a similar issue. Let me verify
whether it is/isn't really reproducible in raid10 first. And we will provide
a fix for it.

Thanks,
BingJing


Thanks,
Guoqing


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4e4dee0..77d6d85 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8527,6 +8527,19 @@ void md_do_sync(struct md_thread *thread)
      * raid */
     set_mask_bits(&mddev->sb_flags, 0,
               BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));
+
+    if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+            !test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
+            mddev->delta_disks > 0 &&
+            mddev->pers->finish_reshape &&
+            mddev->pers->size &&
+            mddev->queue) {
+        mddev_lock_nointr(mddev);
+        md_set_array_sectors(mddev, mddev->pers->size(mddev, 0, 0));
+        mddev_unlock(mddev);
+        set_capacity(mddev->gendisk, mddev->array_sectors);
+        revalidate_disk(mddev->gendisk);
+    }

     spin_lock(&mddev->lock);
     if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 98ce427..0f68faa 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8001,13 +8001,7 @@ static void raid5_finish_reshape(struct mddev *mddev)

     if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {

-        if (mddev->delta_disks > 0) {
-            md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
-            if (mddev->queue) {
-                set_capacity(mddev->gendisk, mddev->array_sectors);
-                revalidate_disk(mddev->gendisk);
-            }
-        } else {
+        if (mddev->delta_disks <= 0) {
             int d;
             spin_lock_irq(&conf->device_lock);
             mddev->degraded = raid5_calc_degraded(conf);

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux