Re: BUG_ON(sh->batch_head) in init_stripe()

Stephane Thiell <sthiell@xxxxxxxxxxxx> · Sun, 13 Aug 2017 17:29:03 +0000

Also known as "kernel BUG at drivers/md/raid5.c:527!"

I’m still struggling with this issue in raid5.c. I currently have one crash per week in average on a system using Lustre and 24 x 10-disk raid6 per server. It did happen with or without check running. I now think there is a race at some point where a released stripe head (batched) is added to conf->inactive_list without sh->batch_head being set back to NULL.

Because sh->batch_head is only being set back to NULL in a single function, break_stripe_batch_list(), my guess is that a call to raid5_release_stripe() on the head stripe is performed elsewhere. I didn’t find where this could come from yet. So perhaps sh->batch_head should be just set to NULL in do_release_stripe(), before being added to the inactive list?

Alternatively, I wonder if it would make more sense to set batch_head to NULL where sh->overwrite_disks is also set back to 0 (like in ops_run_biodrain() today).

Because sh->overwrite_disks is also set to 0 in init_stripe(), my current plan is to try this change, at least as a workaround: remove the BUG_ON(sh->batch_head) from init_stripe() and set sh->batch_head = NULL (I don’t think I need to lock on sh->stripe_lock there?). I will get new hardware soon so I will give it a try on a test system. Any suggestions welcomed…

Quick analysis of the last vmcore:

crash> ps | grep ">"
...
> 16119  16111  22  ffff881f93bace70  RU   0.0  150772  10748  e2label
> 151448      2  34  ffff881ff753bec0  RU   0.0       0      0  [kiblnd_sd_00_00]
> 151449      2   2  ffff881ff753edd0  RU   0.0       0      0  [kiblnd_sd_00_01]
> 154637      2  29  ffff881f776b1f60  RU   0.0       0      0  [md32_raid6]
> 163643      2  11  ffff8820d200bec0  UN   0.0       0      0  [ll_ost_io01_008]
> 164083      2  13  ffff8820d2753ec0  UN   0.0       0      0  [ll_ost_io01_034]
> 171389      2   5  ffff883fd072de20  UN   0.0       0      0  [ll_ost_io01_058]
> 171391      2  17  ffff883fd072ce70  UN   0.0       0      0  [ll_ost_io01_060]
> 171399      2  33  ffff883fd9e9ce70  UN   0.0       0      0  [ll_ost_io01_068]
> 237019      2  15  ffff882715628fb0  UN   0.0       0      0  [ll_ost_io01_084]
> 237033      2   3  ffff881fb0250fb0  UN   0.0       0      0  [ll_ost_io01_098]
...

crash>  bt 237033
PID: 237033  TASK: ffff881fb0250fb0  CPU: 3   COMMAND: "ll_ost_io01_098"
 #0 [ffff8826fa3e7240] machine_kexec at ffffffff81059bdb
 #1 [ffff8826fa3e72a0] __crash_kexec at ffffffff811057c2
 #2 [ffff8826fa3e7370] crash_kexec at ffffffff811058b0
 #3 [ffff8826fa3e7388] oops_end at ffffffff8168fac8
 #4 [ffff8826fa3e73b0] die at ffffffff8102e93b
 #5 [ffff8826fa3e73e0] do_trap at ffffffff8168f180
 #6 [ffff8826fa3e7430] do_invalid_op at ffffffff8102b144
 #7 [ffff8826fa3e74e0] invalid_op at ffffffff81698b5e
    [exception RIP: raid5_get_active_stripe+1809]
    RIP: ffffffffa0fefd71  RSP: ffff8826fa3e7590  RFLAGS: 00010086
    RAX: 0000000000000000  RBX: ffff883df051641c  RCX: dead000000000200
    RDX: 0000000000000000  RSI: 0000000000000005  RDI: ffff881e08d54d30
    RBP: ffff8826fa3e7638   R8: ffff881e08d54d40   R9: 0000000000000080
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff883df0516400
    R13: ffff881e08d54d30  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff8826fa3e7640] raid5_make_request at ffffffffa0feff55 [raid456]
 #9 [ffff8826fa3e7730] md_make_request at ffffffff814f838c
#10 [ffff8826fa3e7790] generic_make_request at ffffffff812eedc9
#11 [ffff8826fa3e77d8] submit_bio at ffffffff812eef11
#12 [ffff8826fa3e7830] osd_submit_bio at ffffffffa10cdbcc [osd_ldiskfs]
#13 [ffff8826fa3e7840] osd_do_bio at ffffffffa10d03c1 [osd_ldiskfs]
#14 [ffff8826fa3e7960] osd_write_commit at ffffffffa10d0b94 [osd_ldiskfs]
#15 [ffff8826fa3e7a00] ofd_commitrw_write at ffffffffa120c734 [ofd]
#16 [ffff8826fa3e7a78] ofd_commitrw at ffffffffa120ff07 [ofd]
#17 [ffff8826fa3e7b08] obd_commitrw at ffffffffa0c7dc22 [ptlrpc]
#18 [ffff8826fa3e7b70] tgt_brw_write at ffffffffa0c55fc1 [ptlrpc]
#19 [ffff8826fa3e7cd8] tgt_request_handle at ffffffffa0c52275 [ptlrpc]
#20 [ffff8826fa3e7d20] ptlrpc_server_handle_request at ffffffffa0bfe1fb [ptlrpc]
#21 [ffff8826fa3e7de8] ptlrpc_main at ffffffffa0c022b0 [ptlrpc]
#22 [ffff8826fa3e7ec8] kthread at ffffffff810b0a4f
#23 [ffff8826fa3e7f50] ret_from_fork at ffffffff81697218

After some disassembly work, I found out that all active tasks are working on the same md (all are raid6):

md32’s (struct r5conf *) is 0xffff883df0516400

Our crashed task:

crash> printf "%s", ((struct r5conf)*0xffff883df0516400).cache_name[0]
raid6-md32

Example for md32_raid6 task in release_inactive_stripe_list:

crash> bt 154637
PID: 154637  TASK: ffff881f776b1f60  CPU: 29  COMMAND: "md32_raid6"
 #0 [ffff883ffc985e48] crash_nmi_callback at ffffffff8104d2e2
 #1 [ffff883ffc985e58] nmi_handle at ffffffff8168fc77
 #2 [ffff883ffc985eb0] do_nmi at ffffffff8168fe83
 #3 [ffff883ffc985ef0] end_repeat_nmi at ffffffff8168f093
    [exception RIP: _raw_spin_lock_irqsave+64]
    RIP: ffffffff8168e5f0  RSP: ffff881e8d0cbc60  RFLAGS: 00000006
    RAX: 00000000000031bb  RBX: ffff883df0516740  RCX: 000000000000009a
    RDX: 0000000000000098  RSI: 000000000000009a  RDI: ffff883df051641c
    RBP: ffff881e8d0cbc60   R8: 0000000000000206   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000007  R12: ffff883df0516400
    R13: ffff883df051641c  R14: 0000000000000006  R15: 0000000000000005
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff881e8d0cbc60] _raw_spin_lock_irqsave at ffffffff8168e5f0
 #5 [ffff881e8d0cbc68] release_inactive_stripe_list at ffffffffa0fe5500 [raid456]
 #6 [ffff881e8d0cbcb0] handle_active_stripes at ffffffffa0ff34c2 [raid456]
 #7 [ffff881e8d0cbd60] raid5d at ffffffffa0ff3c28 [raid456]
 #8 [ffff881e8d0cbe50] md_thread at ffffffff814fb745
 #9 [ffff881e8d0cbec8] kthread at ffffffff810b0a4f
#10 [ffff881e8d0cbf50] ret_from_fork at ffffffff81697218

crash> bt 163643 164083 171389 171391 171399 237019 237033 |grep "exception RIP:"
    [exception RIP: _raw_spin_lock_irq+63]
    [exception RIP: _raw_spin_lock_irq+63]
    [exception RIP: _raw_spin_lock_irq+58]
    [exception RIP: _raw_spin_lock_irq+63]
    [exception RIP: _raw_spin_lock_irq+58]
    [exception RIP: _raw_spin_lock_irq+63]
    [exception RIP: raid5_get_active_stripe+1809]

crash> bt 163643 164083 171389 171391 171399 237019 237033 |grep "raid5_get_active_stripe"
 #5 [ffff883bcee93588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456]
 #5 [ffff883ba4af7588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456]
 #5 [ffff88390877b588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456]
 #5 [ffff88390803b588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456]
 #5 [ffff88390831f588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456]
 #5 [ffff8826fa7f7588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456]
    [exception RIP: raid5_get_active_stripe+1809]

crash> bt 163643 164083 171389 171391 171399 237019 237033 |grep R12
    R10: 0000000000000003  R11: 0000000000000000  R12: ffff883df0516400
    R10: 0000000000000001  R11: 0000000000000000  R12: ffff883df0516400
    R10: 0000000000000005  R11: 0000000000000000  R12: ffff883df0516400
    R10: 0000000000000007  R11: 0000000000000000  R12: ffff883df0516400
    R10: 0000000000000009  R11: 0000000000000000  R12: ffff883df0516400
    R10: 0000000000000009  R11: 0000000000000000  R12: ffff883df0516400
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff883df0516400

The bad stripe_head (sh) is available at (usually in %r13):

crash> p ((struct stripe_head)*0xffff881e08d54d30).raid_conf
$98 = (struct r5conf *) 0xffff883df0516400

and batch_head is not NULL... that's the issue:

crash> p ((struct stripe_head)*0xffff881e08d54d30).batch_head
$99 = (struct stripe_head *) 0xffff881e08d54d30

Full struct dump below. We can see than overwrite_disks is set to 0, state is 0… but this batch_head not set to NULL makes it crash in init_stripe():

crash> p ((struct stripe_head)*0xffff881e08d54d30)
$6 = {
  hash = {
    next = 0x0, 
    pprev = 0x0
  }, 
  lru = {
    next = 0xffff881e08d54d40, 
    prev = 0xffff881e08d54d40
  }, 
  release_list = {
    next = 0xffff881dfe405cc0
  }, 
  raid_conf = 0xffff883df0516400, 
  generation = 0, 
  sector = 3701224808, 
  pd_idx = 1, 
  qd_idx = 2, 
  ddf_layout = 0, 
  hash_lock_index = 5, 
  state = 0, 
  count = {
    counter = 0
  }, 
  bm_seq = 22708787, 
  disks = 10, 
  overwrite_disks = 0, 
  check_state = check_state_idle, 
  reconstruct_state = reconstruct_state_idle, 
  stripe_lock = {
    {
      rlock = {
        raw_lock = {
          {
            head_tail = 3709263126, 
            tickets = {
              head = 56598, 
              tail = 56598
            }
          }
        }
      }
    }
  }, 
  cpu = 15, 
  group = 0x0, 
  batch_head = 0xffff881e08d54d30, 
  batch_lock = {
    {
      rlock = {
        raw_lock = {
          {
            head_tail = 9961624, 
            tickets = {
              head = 152, 
              tail = 152
            }
          }
        }
      }
    }
  }, 
  batch_list = {
    next = 0xffff881dfe405d28, 
    prev = 0xffff881e0e2dec98
  }, 
  log_io = 0x0, 
  log_list = {
    next = 0x0, 
    prev = 0x0
  }, 
  ops = {
    target = 0, 
    target2 = 0, 
    zero_sum_result = (unknown: 0)
  }, 
  dev = {{
      req = {
        bi_sector = 0, 
        bi_next = 0x0, 
        bi_bdev = 0x0, 
        bi_flags = 1, 
        bi_rw = 0, 
        bi_vcnt = 0, 
        bi_idx = 0, 
        bi_phys_segments = 0, 
        bi_size = 0, 
        bi_seg_front_size = 0, 
        bi_seg_back_size = 0, 
        bi_end_io = 0x0, 
        bi_private = 0xffff881e08d54d30, 
        bi_ioc = 0x0, 
        bi_css = 0x0, 
        bi_integrity = 0x0, 
        bi_max_vecs = 1, 
        bi_cnt = {
          counter = 1
        }, 
        bi_io_vec = 0xffff881e08d54f00, 
        bi_pool = 0x0, 
        {
          bio_aux = 0x0, 
          __UNIQUE_ID_rh_kabi_hide15 = {
            rh_reserved1 = 0x0
          }, 
          {<No data fields>}
        }, 
        bi_inline_vecs = 0xffff881e08d54e78
      }, 
      rreq = {
        bi_sector = 0, 
        bi_next = 0x0, 
        bi_bdev = 0x0, 
        bi_flags = 1, 
        bi_rw = 0, 
        bi_vcnt = 0, 
        bi_idx = 0, 
        bi_phys_segments = 0, 
        bi_size = 0, 
        bi_seg_front_size = 0, 
        bi_seg_back_size = 0, 
        bi_end_io = 0x0, 
        bi_private = 0xffff881e08d54d30, 
        bi_ioc = 0x0, 
        bi_css = 0x0, 
        bi_integrity = 0x0, 
        bi_max_vecs = 1, 
        bi_cnt = {
          counter = 1
        }, 
        bi_io_vec = 0xffff881e08d54f10, 
        bi_pool = 0x0, 
        {
          bio_aux = 0x0, 
          __UNIQUE_ID_rh_kabi_hide15 = {
            rh_reserved1 = 0x0
          }, 
          {<No data fields>}
        }, 
        bi_inline_vecs = 0xffff881e08d54f00
      }, 
      vec = {
        bv_page = 0xffffea0078236080, 
        bv_len = 4096, 
        bv_offset = 0
      }, 
      rvec = {
        bv_page = 0x0, 
        bv_len = 0, 
        bv_offset = 0
      }, 
      page = 0xffffea0078236080, 
      orig_page = 0xffffea0078236080, 
      toread = 0x0, 
      read = 0x0, 
      towrite = 0x0, 
      written = 0xffff8824832a7c00, 
      sector = 29609798632, 
      flags = 24, 
      log_checksum = 0
    }}
}

Thanks,

Stephane

> On May 8, 2017, at 10:15 AM, Stephane Thiell <sthiell@xxxxxxxxxxxx> wrote:
> 
> Hi list,
> 
> I have a recurring issue using raid6 which results in a panic due to a BUG_ON(sh->batch_head). During last weekend, the issue occurred during a weekly raid-check. The raid volumes (12 total) are pretty new, no mismatch nor hardware errors have been detected.
> 
> [535089.369357] kernel BUG at drivers/md/raid5.c:527!
> [535089.374700] invalid opcode: 0000 [#1] SMP 
> [535089.379384] Modules linked in: ...
> [535089.503509] CPU: 34 PID: 280061 Comm: md0_resync Tainted: G           OE  ------------   3.10.0-514.10.2.el7_lustre.x86_64 #1
> 
> This is the backtrace:
> 
> crash> bt 280061
> PID: 280061  TASK: ffff8800757cde20  CPU: 34  COMMAND: "md0_resync"
> #0 [ffff88024e217830] machine_kexec at ffffffff81059bdb
> #1 [ffff88024e217890] __crash_kexec at ffffffff81105382
> #2 [ffff88024e217960] crash_kexec at ffffffff81105470
> #3 [ffff88024e217978] oops_end at ffffffff8168f508
> #4 [ffff88024e2179a0] die at ffffffff8102e93b
> #5 [ffff88024e2179d0] do_trap at ffffffff8168ebc0
> #6 [ffff88024e217a20] do_invalid_op at ffffffff8102b144
> #7 [ffff88024e217ad0] invalid_op at ffffffff816984de
>    [exception RIP: raid5_get_active_stripe+1809]
>    RIP: ffffffffa0e4ed71  RSP: ffff88024e217b88  RFLAGS: 00010086
>    RAX: 0000000000000000  RBX: ffff883fe5e40408  RCX: dead000000000200
>    RDX: 0000000000000000  RSI: 0000000000000000  RDI: ffff8823ebf45ca0
>    RBP: ffff88024e217c30   R8: ffff8823ebf45cb0   R9: 0000000000000080
>    R10: 0000000000000006  R11: 0000000000000000  R12: ffff883fe5e40400
>    R13: ffff8823ebf45ca0  R14: 0000000000000000  R15: 0000000000000000
>    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> #8 [ffff88024e217b80] raid5_get_active_stripe at ffffffffa0e4e996 [raid456]
> #9 [ffff88024e217be0] raid5_release_stripe at ffffffffa0e48f24 [raid456]
> #10 [ffff88024e217c38] raid5_sync_request at ffffffffa0e53c4b [raid456]
> #11 [ffff88024e217ca8] md_do_sync at ffffffff814fef9b
> #12 [ffff88024e217e50] md_thread at ffffffff814fb1b5
> #13 [ffff88024e217ec8] kthread at ffffffff810b06ff
> #14 [ffff88024e217f50] ret_from_fork at ffffffff81696b98
> 
> It appears to be triggered by BUG_ON(sh->batch_head) in init_stripe():
> 
> crash> l drivers/md/raid5.c:524
> 519     static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
> 520     {
> 521             struct r5conf *conf = sh->raid_conf;
> 522             int i, seq;
> 523     
> 524             BUG_ON(atomic_read(&sh->count) != 0);
> 525             BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
> 526             BUG_ON(stripe_operations_active(sh));
> 527             BUG_ON(sh->batch_head);    <<<
> 528 
> 
> Other I/Os were processed at this time, but I am not sure how to check that they were actually on the same md:
> 
> crash> ps | grep ">"
> <snip>
>> 59684      2  28  ffff88407fdc0fb0  RU   0.0       0      0  [md0_raid6]
>> 61479      2  17  ffff883e46e80000  UN   0.0       0      0  [ll_ost_io01_001]
>> 220748      2  23  ffff881fb7ab8fb0  UN   0.0       0      0  [ll_ost_io01_011]
>> 220750      2  19  ffff881fb7abce70  UN   0.0       0      0  [ll_ost_io01_013]
>> 279158      2  14  ffff883ab46b4e70  RU   0.0       0      0  [md22_resync]
>> 280061      2  34  ffff8800757cde20  RU   0.0       0      0  [md0_resync]
>> 280829      2   6  ffff881c72296dd0  RU   0.0       0      0  [md6_resync]
> 
> example of possible concurrent writing thread:
> 
> crash> bt 61479
> PID: 61479  TASK: ffff883e46e80000  CPU: 17  COMMAND: "ll_ost_io01_001"
> #0 [ffff883ffc805e58] crash_nmi_callback at ffffffff8104d2e2
> #1 [ffff883ffc805e68] nmi_handle at ffffffff8168f699
> #2 [ffff883ffc805eb0] do_nmi at ffffffff8168f813
> #3 [ffff883ffc805ef0] end_repeat_nmi at ffffffff8168ead3
>    [exception RIP: _raw_spin_lock_irq+63]
>    RIP: ffffffff8168e09f  RSP: ffff883e46e3f588  RFLAGS: 00000002
>    RAX: 00000000000044c2  RBX: ffff883fe5e40408  RCX: 000000000000c464
>    RDX: 000000000000c468  RSI: 000000000000c468  RDI: ffff883fe5e40408
>    RBP: ffff883e46e3f588   R8: 0000000000000000   R9: 0000000000000080
>    R10: 0000000000000002  R11: 0000000000000000  R12: ffff883fe5e40400
>    R13: 0000000000000000  R14: ffff883fe0e61900  R15: 0000000000000000
>    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
> #4 [ffff883e46e3f588] _raw_spin_lock_irq at ffffffff8168e09f
> #5 [ffff883e46e3f590] raid5_get_active_stripe at ffffffffa0e4e6cb [raid456]
> #6 [ffff883e46e3f648] raid5_make_request at ffffffffa0e4ef55 [raid456]
> #7 [ffff883e46e3f738] md_make_request at ffffffff814f7dfc
> #8 [ffff883e46e3f798] generic_make_request at ffffffff812ee939
> #9 [ffff883e46e3f7e0] submit_bio at ffffffff812eea81
> #10 [ffff883e46e3f838] osd_submit_bio at ffffffffa10a0bcc [osd_ldiskfs]
> #11 [ffff883e46e3f848] osd_do_bio at ffffffffa10a3007 [osd_ldiskfs]
> #12 [ffff883e46e3f968] osd_write_commit at ffffffffa10a3b94 [osd_ldiskfs]
> #13 [ffff883e46e3fa08] ofd_commitrw_write at ffffffffa1113774 [ofd]
> #14 [ffff883e46e3fa80] ofd_commitrw at ffffffffa1116f2d [ofd]
> #15 [ffff883e46e3fb08] obd_commitrw at ffffffffa0c43c22 [ptlrpc]
> #16 [ffff883e46e3fb70] tgt_brw_write at ffffffffa0c1bfc1 [ptlrpc]
> #17 [ffff883e46e3fcd8] tgt_request_handle at ffffffffa0c18275 [ptlrpc]
> #18 [ffff883e46e3fd20] ptlrpc_server_handle_request at ffffffffa0bc41fb [ptlrpc]
> #19 [ffff883e46e3fde8] ptlrpc_main at ffffffffa0bc82b0 [ptlrpc]
> #20 [ffff883e46e3fec8] kthread at ffffffff810b06ff
> #21 [ffff883e46e3ff50] ret_from_fork at ffffffff81696b98
> 
> 
> The same issue happened during a heavy IOR benchmark a few months ago and was described in https://jira.hpdd.intel.com/browse/LU-8917 (on a slightly older el7 kernel).
> 
> I also found out that other users have described similar issues, like on this thread: https://lkml.org/lkml/2016/12/23/205
> 
> Now, I have a crash dump and I’m trying to understand why sh->batch_head could be set in init_stripe(), which is called by raid5_get_active_stripe() when __find_stripe() failed BUT get_free_stripe() succeeded. If sh->batch_head is set in that case, that means the idle stripe found had it set…
> 
> Does someone have any idea of how to troubleshoot or solve this?
> 
> Thanks!
> 
> Stephan

��.n��������+%������w��{.n�����{����w��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f