Re: [BUG] MD/RAID1 hung forever on freeze_array

Jinpu Wang <jinpu.wang@xxxxxxxxxxxxxxxx> · Wed, 14 Dec 2016 13:13:27 +0100

On Wed, Dec 14, 2016 at 11:22 AM, Jinpu Wang
<jinpu.wang@xxxxxxxxxxxxxxxx> wrote:
> Thanks Neil,
>
> On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@xxxxxxxx> wrote:
>> On Wed, Dec 14 2016, Jinpu Wang wrote:
>>
>>>
>>> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
>>> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.
>>>
>>
>> Thanks.
>>
>> I have an hypothesis.
>>
>> md_make_request() calls blk_queue_split().
>> If that does split the request it will call generic_make_request()
>> on the first half. That will call back into md_make_request() and
>> raid1_make_request() which will submit requests to the underlying
>> devices.  These will get caught on the bio_list_on_stack queue in
>> generic_make_request().
>> This is a queue which is not accounted in nr_queued.
>>
>> When blk_queue_split() completes, 'bio' will be the second half of the
>> bio.
>> This enters raid1_make_request() and by this time the array have been
>> frozen.
>> So wait_barrier() has to wait for pending requests to complete, and that
>> includes the one that it stuck in bio_list_on_stack, which will never
>> complete now.
>>
>> To see if this might be happening, please change the
>>
>>         blk_queue_split(q, &bio, q->bio_split);
>>
>> call in md_make_request() to
>>
>>         struct bio *tmp = bio;
>>         blk_queue_split(q, &bio, q->bio_split);
>>         WARN_ON_ONCE(bio != tmp);
>>
>> If that ever triggers, then the above is a real possibility.
>
> I triggered the warning as you expected, we can confirm the bug was
> caused by your above hypothesis.
> [  429.282235] ------------[ cut here ]------------
> [  429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262
> md_set_array_sectors+0xac0/0xc30 [md_mod]()
> [  429.285288] Modules linked in: raid1 ibnbd_client(O)
> ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm
> iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c
> ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core
> ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd
> edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm
> _infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor
> button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci
> libahci libata scsi_mod crc32c_intel r8169 psmo
> use xhci_pci xhci_hcd [last unloaded: mlx_compat]
> [  429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G           O    4.4.36-1-pse
> rver #1
> [  429.288825] Hardware name: To be filled by O.E.M. To be filled by
> O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014
> [  429.289113]  0000000000000000 ffff8801f64ff8f0 ffffffff81424486
> 0000000000000000
> [  429.289538]  ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60
> ffff8800b8f3e000
> [  429.290157]  0000000000000000 ffff8800b51f4100 ffff880234f9a700
> ffff880234f9a700
> [  429.290594] Call Trace:
> [  429.290743]  [<ffffffff81424486>] dump_stack+0x4d/0x67
> [  429.290893]  [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0
> [  429.291046]  [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20
> [  429.291202]  [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod]
> [  429.291358]  [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0
> [  429.291540]  [<ffffffff813fd522>] submit_bio+0x62/0x150
> [  429.291693]  [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60
> [  429.291847]  [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0
> [  429.292011]  [<ffffffffa0834f64>] ?
> ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client]
> [  429.292271]  [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10
> [  429.292417]  [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40
> [  429.292566]  [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50
> [  429.292746]  [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580
> [  429.292894]  [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110
> [  429.293073]  [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40
> [  429.293284]  [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0
> [  429.293492]  [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200
> [  429.293703]  [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180
> [  429.293901]  [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0
> [  429.294047]  [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0
> [  429.294194]  [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10
> [  429.294375]  [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [  429.294610] ---[ end trace 25d1cece0e01494b ]---
>
> I double checked the nr_pending on heathy leg is still 1 as before.
>
>>
>> Fixing the problem isn't very easy...
>>
>> You could try:
>> 1/ write a function in raid1.c which calls punt_bios_to_rescuer()
>>   (which you will need to export from block/bio.c),
>>   passing mddev->queue->bio_split as the bio_set.
>>
>> 1/ change the wait_event_lock_irq() call in wait_barrier() to
>>    wait_event_lock_irq_cmd(), and pass the new function as the command.
>>
>> That way, if wait_barrier() ever blocks, all the requests in
>> bio_list_on_stack will be handled by a separate thread.
>>
>> NeilBrown
>
> I will try your sugested way to see if it fix the bug, will report back soon.
>
Hi Neil,

Sorry, bad news, with the 2 patch attached, I can still reproduce the same bug.
nr_pending on healthy leg is still 1, as before.
crash> struct r1conf 0xffff8800b7176100
struct r1conf {
  mddev = 0xffff8800b59b0000,
  mirrors = 0xffff88022bab7900,
  raid_disks = 2,
  next_resync = 18446744073709527039,
  start_next_window = 18446744073709551615,
  current_window_requests = 0,
  next_window_requests = 0,
  device_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  retry_list = {
    next = 0xffff880211b2ec40,
    prev = 0xffff88022819ad40
  },
  bio_end_io_list = {
    next = 0xffff880227e9a9c0,
    prev = 0xffff8802119c6140
  },
  pending_bio_list = {
    head = 0x0,
    tail = 0x0
  },
  pending_count = 0,
  wait_barrier = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8800adf3b818,
      prev = 0xffff88021180f7a8
    }
  },
  resync_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  nr_pending = 1675,
  nr_waiting = 100,
  nr_queued = 1673,
  barrier = 0,
  array_frozen = 1,
  fullsync = 0,
  recovery_disabled = 1,
  poolinfo = 0xffff88022c80f640,
  r1bio_pool = 0xffff88022b8b6a20,
  r1buf_pool = 0x0,
  tmppage = 0xffffea0008a90c80,
  thread = 0x0,
  cluster_sync_low = 0,
  cluster_sync_high = 0
}

 kobj = {
    name = 0xffff88022b7194a0 "dev-loop1",
    entry = {
      next = 0xffff880231495280,
      prev = 0xffff880231495280
    },
    parent = 0xffff8800b59b0050,
    kset = 0x0,
    ktype = 0xffffffffa0564060 <rdev_ktype>,
    sd = 0xffff8800b6510960,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8802314952c8,
      prev = 0xffff8802314952c8
    }
  },
  desc_nr = 1,
  raid_disk = 1,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 0,
    journal_tail = 0
  },
  nr_pending = {
    counter = 1
  },


-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@xxxxxxxxxxxxxxxx
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss
From e7adbbb1a8d542ea68ada5996e0f9ffe87c479b6 Mon Sep 17 00:00:00 2001
From: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx>
Date: Wed, 14 Dec 2016 11:26:23 +0100
Subject: [PATCH 1/2] block: export punt_bios_to_rescuer

We need it later in raid1

Suggested-by: Neil Brown <neil.brown@xxxxxxxx>
Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx>
---
 block/bio.c         | 3 ++-
 include/linux/bio.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 46e2cc1..f6a250d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -354,7 +354,7 @@ static void bio_alloc_rescue(struct work_struct *work)
 	}
 }
 
-static void punt_bios_to_rescuer(struct bio_set *bs)
+void punt_bios_to_rescuer(struct bio_set *bs)
 {
 	struct bio_list punt, nopunt;
 	struct bio *bio;
@@ -384,6 +384,7 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 
 	queue_work(bs->rescue_workqueue, &bs->rescue_work);
 }
+EXPORT_SYMBOL(punt_bios_to_rescuer);
 
 /**
  * bio_alloc_bioset - allocate a bio for I/O
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 42e4e3c..6256ba7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -479,6 +479,7 @@ extern void bio_advance(struct bio *, unsigned);
 extern void bio_init(struct bio *);
 extern void bio_reset(struct bio *);
 void bio_chain(struct bio *, struct bio *);
+void punt_bios_to_rescuer(struct bio_set *);
 
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
-- 
2.7.4

From 2ad4cc5e8b5d7ec9db7a6fffaa2fdcd5f20419bf Mon Sep 17 00:00:00 2001
From: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx>
Date: Wed, 14 Dec 2016 11:35:52 +0100
Subject: [PATCH 2/2] raid1: fix deadlock

Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx>
---
 drivers/md/raid1.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 478223c..61dafb1 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -190,6 +190,11 @@ static void put_all_bios(struct r1conf *conf, struct r1bio *r1_bio)
 	}
 }
 
+static void raid1_punt_bios_to_rescuer(struct r1conf *conf)
+{
+	punt_bios_to_rescuer(conf->mddev->queue->bio_split);
+}
+
 static void free_r1bio(struct r1bio *r1_bio)
 {
 	struct r1conf *conf = r1_bio->mddev->private;
@@ -871,14 +876,15 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 		 * that queue to allow conf->start_next_window
 		 * to increase.
 		 */
-		wait_event_lock_irq(conf->wait_barrier,
+		wait_event_lock_irq_cmd(conf->wait_barrier,
 				    !conf->array_frozen &&
 				    (!conf->barrier ||
 				     ((conf->start_next_window <
 				       conf->next_resync + RESYNC_SECTORS) &&
 				      current->bio_list &&
 				      !bio_list_empty(current->bio_list))),
-				    conf->resync_lock);
+				    conf->resync_lock,
+				    raid1_punt_bios_to_rescuer(conf));
 		conf->nr_waiting--;
 	}
 
-- 
2.7.4