Also known as "kernel BUG at drivers/md/raid5.c:527!" I’m still struggling with this issue in raid5.c. I currently have one crash per week in average on a system using Lustre and 24 x 10-disk raid6 per server. It did happen with or without check running. I now think there is a race at some point where a released stripe head (batched) is added to conf->inactive_list without sh->batch_head being set back to NULL. Because sh->batch_head is only being set back to NULL in a single function, break_stripe_batch_list(), my guess is that a call to raid5_release_stripe() on the head stripe is performed elsewhere. I didn’t find where this could come from yet. So perhaps sh->batch_head should be just set to NULL in do_release_stripe(), before being added to the inactive list? Alternatively, I wonder if it would make more sense to set batch_head to NULL where sh->overwrite_disks is also set back to 0 (like in ops_run_biodrain() today). Because sh->overwrite_disks is also set to 0 in init_stripe(), my current plan is to try this change, at least as a workaround: remove the BUG_ON(sh->batch_head) from init_stripe() and set sh->batch_head = NULL (I don’t think I need to lock on sh->stripe_lock there?). I will get new hardware soon so I will give it a try on a test system. Any suggestions welcomed… Quick analysis of the last vmcore: crash> ps | grep ">" ... > 16119 16111 22 ffff881f93bace70 RU 0.0 150772 10748 e2label > 151448 2 34 ffff881ff753bec0 RU 0.0 0 0 [kiblnd_sd_00_00] > 151449 2 2 ffff881ff753edd0 RU 0.0 0 0 [kiblnd_sd_00_01] > 154637 2 29 ffff881f776b1f60 RU 0.0 0 0 [md32_raid6] > 163643 2 11 ffff8820d200bec0 UN 0.0 0 0 [ll_ost_io01_008] > 164083 2 13 ffff8820d2753ec0 UN 0.0 0 0 [ll_ost_io01_034] > 171389 2 5 ffff883fd072de20 UN 0.0 0 0 [ll_ost_io01_058] > 171391 2 17 ffff883fd072ce70 UN 0.0 0 0 [ll_ost_io01_060] > 171399 2 33 ffff883fd9e9ce70 UN 0.0 0 0 [ll_ost_io01_068] > 237019 2 15 ffff882715628fb0 UN 0.0 0 0 [ll_ost_io01_084] > 237033 2 3 ffff881fb0250fb0 UN 0.0 0 0 [ll_ost_io01_098] ... crash> bt 237033 PID: 237033 TASK: ffff881fb0250fb0 CPU: 3 COMMAND: "ll_ost_io01_098" #0 [ffff8826fa3e7240] machine_kexec at ffffffff81059bdb #1 [ffff8826fa3e72a0] __crash_kexec at ffffffff811057c2 #2 [ffff8826fa3e7370] crash_kexec at ffffffff811058b0 #3 [ffff8826fa3e7388] oops_end at ffffffff8168fac8 #4 [ffff8826fa3e73b0] die at ffffffff8102e93b #5 [ffff8826fa3e73e0] do_trap at ffffffff8168f180 #6 [ffff8826fa3e7430] do_invalid_op at ffffffff8102b144 #7 [ffff8826fa3e74e0] invalid_op at ffffffff81698b5e [exception RIP: raid5_get_active_stripe+1809] RIP: ffffffffa0fefd71 RSP: ffff8826fa3e7590 RFLAGS: 00010086 RAX: 0000000000000000 RBX: ffff883df051641c RCX: dead000000000200 RDX: 0000000000000000 RSI: 0000000000000005 RDI: ffff881e08d54d30 RBP: ffff8826fa3e7638 R8: ffff881e08d54d40 R9: 0000000000000080 R10: 0000000000000000 R11: 0000000000000000 R12: ffff883df0516400 R13: ffff881e08d54d30 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff8826fa3e7640] raid5_make_request at ffffffffa0feff55 [raid456] #9 [ffff8826fa3e7730] md_make_request at ffffffff814f838c #10 [ffff8826fa3e7790] generic_make_request at ffffffff812eedc9 #11 [ffff8826fa3e77d8] submit_bio at ffffffff812eef11 #12 [ffff8826fa3e7830] osd_submit_bio at ffffffffa10cdbcc [osd_ldiskfs] #13 [ffff8826fa3e7840] osd_do_bio at ffffffffa10d03c1 [osd_ldiskfs] #14 [ffff8826fa3e7960] osd_write_commit at ffffffffa10d0b94 [osd_ldiskfs] #15 [ffff8826fa3e7a00] ofd_commitrw_write at ffffffffa120c734 [ofd] #16 [ffff8826fa3e7a78] ofd_commitrw at ffffffffa120ff07 [ofd] #17 [ffff8826fa3e7b08] obd_commitrw at ffffffffa0c7dc22 [ptlrpc] #18 [ffff8826fa3e7b70] tgt_brw_write at ffffffffa0c55fc1 [ptlrpc] #19 [ffff8826fa3e7cd8] tgt_request_handle at ffffffffa0c52275 [ptlrpc] #20 [ffff8826fa3e7d20] ptlrpc_server_handle_request at ffffffffa0bfe1fb [ptlrpc] #21 [ffff8826fa3e7de8] ptlrpc_main at ffffffffa0c022b0 [ptlrpc] #22 [ffff8826fa3e7ec8] kthread at ffffffff810b0a4f #23 [ffff8826fa3e7f50] ret_from_fork at ffffffff81697218 After some disassembly work, I found out that all active tasks are working on the same md (all are raid6): md32’s (struct r5conf *) is 0xffff883df0516400 Our crashed task: crash> printf "%s", ((struct r5conf)*0xffff883df0516400).cache_name[0] raid6-md32 Example for md32_raid6 task in release_inactive_stripe_list: crash> bt 154637 PID: 154637 TASK: ffff881f776b1f60 CPU: 29 COMMAND: "md32_raid6" #0 [ffff883ffc985e48] crash_nmi_callback at ffffffff8104d2e2 #1 [ffff883ffc985e58] nmi_handle at ffffffff8168fc77 #2 [ffff883ffc985eb0] do_nmi at ffffffff8168fe83 #3 [ffff883ffc985ef0] end_repeat_nmi at ffffffff8168f093 [exception RIP: _raw_spin_lock_irqsave+64] RIP: ffffffff8168e5f0 RSP: ffff881e8d0cbc60 RFLAGS: 00000006 RAX: 00000000000031bb RBX: ffff883df0516740 RCX: 000000000000009a RDX: 0000000000000098 RSI: 000000000000009a RDI: ffff883df051641c RBP: ffff881e8d0cbc60 R8: 0000000000000206 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000007 R12: ffff883df0516400 R13: ffff883df051641c R14: 0000000000000006 R15: 0000000000000005 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #4 [ffff881e8d0cbc60] _raw_spin_lock_irqsave at ffffffff8168e5f0 #5 [ffff881e8d0cbc68] release_inactive_stripe_list at ffffffffa0fe5500 [raid456] #6 [ffff881e8d0cbcb0] handle_active_stripes at ffffffffa0ff34c2 [raid456] #7 [ffff881e8d0cbd60] raid5d at ffffffffa0ff3c28 [raid456] #8 [ffff881e8d0cbe50] md_thread at ffffffff814fb745 #9 [ffff881e8d0cbec8] kthread at ffffffff810b0a4f #10 [ffff881e8d0cbf50] ret_from_fork at ffffffff81697218 crash> bt 163643 164083 171389 171391 171399 237019 237033 |grep "exception RIP:" [exception RIP: _raw_spin_lock_irq+63] [exception RIP: _raw_spin_lock_irq+63] [exception RIP: _raw_spin_lock_irq+58] [exception RIP: _raw_spin_lock_irq+63] [exception RIP: _raw_spin_lock_irq+58] [exception RIP: _raw_spin_lock_irq+63] [exception RIP: raid5_get_active_stripe+1809] crash> bt 163643 164083 171389 171391 171399 237019 237033 |grep "raid5_get_active_stripe" #5 [ffff883bcee93588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456] #5 [ffff883ba4af7588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456] #5 [ffff88390877b588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456] #5 [ffff88390803b588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456] #5 [ffff88390831f588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456] #5 [ffff8826fa7f7588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456] [exception RIP: raid5_get_active_stripe+1809] crash> bt 163643 164083 171389 171391 171399 237019 237033 |grep R12 R10: 0000000000000003 R11: 0000000000000000 R12: ffff883df0516400 R10: 0000000000000001 R11: 0000000000000000 R12: ffff883df0516400 R10: 0000000000000005 R11: 0000000000000000 R12: ffff883df0516400 R10: 0000000000000007 R11: 0000000000000000 R12: ffff883df0516400 R10: 0000000000000009 R11: 0000000000000000 R12: ffff883df0516400 R10: 0000000000000009 R11: 0000000000000000 R12: ffff883df0516400 R10: 0000000000000000 R11: 0000000000000000 R12: ffff883df0516400 The bad stripe_head (sh) is available at (usually in %r13): crash> p ((struct stripe_head)*0xffff881e08d54d30).raid_conf $98 = (struct r5conf *) 0xffff883df0516400 and batch_head is not NULL... that's the issue: crash> p ((struct stripe_head)*0xffff881e08d54d30).batch_head $99 = (struct stripe_head *) 0xffff881e08d54d30 Full struct dump below. We can see than overwrite_disks is set to 0, state is 0… but this batch_head not set to NULL makes it crash in init_stripe(): crash> p ((struct stripe_head)*0xffff881e08d54d30) $6 = { hash = { next = 0x0, pprev = 0x0 }, lru = { next = 0xffff881e08d54d40, prev = 0xffff881e08d54d40 }, release_list = { next = 0xffff881dfe405cc0 }, raid_conf = 0xffff883df0516400, generation = 0, sector = 3701224808, pd_idx = 1, qd_idx = 2, ddf_layout = 0, hash_lock_index = 5, state = 0, count = { counter = 0 }, bm_seq = 22708787, disks = 10, overwrite_disks = 0, check_state = check_state_idle, reconstruct_state = reconstruct_state_idle, stripe_lock = { { rlock = { raw_lock = { { head_tail = 3709263126, tickets = { head = 56598, tail = 56598 } } } } } }, cpu = 15, group = 0x0, batch_head = 0xffff881e08d54d30, batch_lock = { { rlock = { raw_lock = { { head_tail = 9961624, tickets = { head = 152, tail = 152 } } } } } }, batch_list = { next = 0xffff881dfe405d28, prev = 0xffff881e0e2dec98 }, log_io = 0x0, log_list = { next = 0x0, prev = 0x0 }, ops = { target = 0, target2 = 0, zero_sum_result = (unknown: 0) }, dev = {{ req = { bi_sector = 0, bi_next = 0x0, bi_bdev = 0x0, bi_flags = 1, bi_rw = 0, bi_vcnt = 0, bi_idx = 0, bi_phys_segments = 0, bi_size = 0, bi_seg_front_size = 0, bi_seg_back_size = 0, bi_end_io = 0x0, bi_private = 0xffff881e08d54d30, bi_ioc = 0x0, bi_css = 0x0, bi_integrity = 0x0, bi_max_vecs = 1, bi_cnt = { counter = 1 }, bi_io_vec = 0xffff881e08d54f00, bi_pool = 0x0, { bio_aux = 0x0, __UNIQUE_ID_rh_kabi_hide15 = { rh_reserved1 = 0x0 }, {<No data fields>} }, bi_inline_vecs = 0xffff881e08d54e78 }, rreq = { bi_sector = 0, bi_next = 0x0, bi_bdev = 0x0, bi_flags = 1, bi_rw = 0, bi_vcnt = 0, bi_idx = 0, bi_phys_segments = 0, bi_size = 0, bi_seg_front_size = 0, bi_seg_back_size = 0, bi_end_io = 0x0, bi_private = 0xffff881e08d54d30, bi_ioc = 0x0, bi_css = 0x0, bi_integrity = 0x0, bi_max_vecs = 1, bi_cnt = { counter = 1 }, bi_io_vec = 0xffff881e08d54f10, bi_pool = 0x0, { bio_aux = 0x0, __UNIQUE_ID_rh_kabi_hide15 = { rh_reserved1 = 0x0 }, {<No data fields>} }, bi_inline_vecs = 0xffff881e08d54f00 }, vec = { bv_page = 0xffffea0078236080, bv_len = 4096, bv_offset = 0 }, rvec = { bv_page = 0x0, bv_len = 0, bv_offset = 0 }, page = 0xffffea0078236080, orig_page = 0xffffea0078236080, toread = 0x0, read = 0x0, towrite = 0x0, written = 0xffff8824832a7c00, sector = 29609798632, flags = 24, log_checksum = 0 }} } Thanks, Stephane > On May 8, 2017, at 10:15 AM, Stephane Thiell <sthiell@xxxxxxxxxxxx> wrote: > > Hi list, > > I have a recurring issue using raid6 which results in a panic due to a BUG_ON(sh->batch_head). During last weekend, the issue occurred during a weekly raid-check. The raid volumes (12 total) are pretty new, no mismatch nor hardware errors have been detected. > > [535089.369357] kernel BUG at drivers/md/raid5.c:527! > [535089.374700] invalid opcode: 0000 [#1] SMP > [535089.379384] Modules linked in: ... > [535089.503509] CPU: 34 PID: 280061 Comm: md0_resync Tainted: G OE ------------ 3.10.0-514.10.2.el7_lustre.x86_64 #1 > > This is the backtrace: > > crash> bt 280061 > PID: 280061 TASK: ffff8800757cde20 CPU: 34 COMMAND: "md0_resync" > #0 [ffff88024e217830] machine_kexec at ffffffff81059bdb > #1 [ffff88024e217890] __crash_kexec at ffffffff81105382 > #2 [ffff88024e217960] crash_kexec at ffffffff81105470 > #3 [ffff88024e217978] oops_end at ffffffff8168f508 > #4 [ffff88024e2179a0] die at ffffffff8102e93b > #5 [ffff88024e2179d0] do_trap at ffffffff8168ebc0 > #6 [ffff88024e217a20] do_invalid_op at ffffffff8102b144 > #7 [ffff88024e217ad0] invalid_op at ffffffff816984de > [exception RIP: raid5_get_active_stripe+1809] > RIP: ffffffffa0e4ed71 RSP: ffff88024e217b88 RFLAGS: 00010086 > RAX: 0000000000000000 RBX: ffff883fe5e40408 RCX: dead000000000200 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8823ebf45ca0 > RBP: ffff88024e217c30 R8: ffff8823ebf45cb0 R9: 0000000000000080 > R10: 0000000000000006 R11: 0000000000000000 R12: ffff883fe5e40400 > R13: ffff8823ebf45ca0 R14: 0000000000000000 R15: 0000000000000000 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #8 [ffff88024e217b80] raid5_get_active_stripe at ffffffffa0e4e996 [raid456] > #9 [ffff88024e217be0] raid5_release_stripe at ffffffffa0e48f24 [raid456] > #10 [ffff88024e217c38] raid5_sync_request at ffffffffa0e53c4b [raid456] > #11 [ffff88024e217ca8] md_do_sync at ffffffff814fef9b > #12 [ffff88024e217e50] md_thread at ffffffff814fb1b5 > #13 [ffff88024e217ec8] kthread at ffffffff810b06ff > #14 [ffff88024e217f50] ret_from_fork at ffffffff81696b98 > > It appears to be triggered by BUG_ON(sh->batch_head) in init_stripe(): > > crash> l drivers/md/raid5.c:524 > 519 static void init_stripe(struct stripe_head *sh, sector_t sector, int previous) > 520 { > 521 struct r5conf *conf = sh->raid_conf; > 522 int i, seq; > 523 > 524 BUG_ON(atomic_read(&sh->count) != 0); > 525 BUG_ON(test_bit(STRIPE_HANDLE, &sh->state)); > 526 BUG_ON(stripe_operations_active(sh)); > 527 BUG_ON(sh->batch_head); <<< > 528 > > Other I/Os were processed at this time, but I am not sure how to check that they were actually on the same md: > > crash> ps | grep ">" > <snip> >> 59684 2 28 ffff88407fdc0fb0 RU 0.0 0 0 [md0_raid6] >> 61479 2 17 ffff883e46e80000 UN 0.0 0 0 [ll_ost_io01_001] >> 220748 2 23 ffff881fb7ab8fb0 UN 0.0 0 0 [ll_ost_io01_011] >> 220750 2 19 ffff881fb7abce70 UN 0.0 0 0 [ll_ost_io01_013] >> 279158 2 14 ffff883ab46b4e70 RU 0.0 0 0 [md22_resync] >> 280061 2 34 ffff8800757cde20 RU 0.0 0 0 [md0_resync] >> 280829 2 6 ffff881c72296dd0 RU 0.0 0 0 [md6_resync] > > example of possible concurrent writing thread: > > crash> bt 61479 > PID: 61479 TASK: ffff883e46e80000 CPU: 17 COMMAND: "ll_ost_io01_001" > #0 [ffff883ffc805e58] crash_nmi_callback at ffffffff8104d2e2 > #1 [ffff883ffc805e68] nmi_handle at ffffffff8168f699 > #2 [ffff883ffc805eb0] do_nmi at ffffffff8168f813 > #3 [ffff883ffc805ef0] end_repeat_nmi at ffffffff8168ead3 > [exception RIP: _raw_spin_lock_irq+63] > RIP: ffffffff8168e09f RSP: ffff883e46e3f588 RFLAGS: 00000002 > RAX: 00000000000044c2 RBX: ffff883fe5e40408 RCX: 000000000000c464 > RDX: 000000000000c468 RSI: 000000000000c468 RDI: ffff883fe5e40408 > RBP: ffff883e46e3f588 R8: 0000000000000000 R9: 0000000000000080 > R10: 0000000000000002 R11: 0000000000000000 R12: ffff883fe5e40400 > R13: 0000000000000000 R14: ffff883fe0e61900 R15: 0000000000000000 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > --- <NMI exception stack> --- > #4 [ffff883e46e3f588] _raw_spin_lock_irq at ffffffff8168e09f > #5 [ffff883e46e3f590] raid5_get_active_stripe at ffffffffa0e4e6cb [raid456] > #6 [ffff883e46e3f648] raid5_make_request at ffffffffa0e4ef55 [raid456] > #7 [ffff883e46e3f738] md_make_request at ffffffff814f7dfc > #8 [ffff883e46e3f798] generic_make_request at ffffffff812ee939 > #9 [ffff883e46e3f7e0] submit_bio at ffffffff812eea81 > #10 [ffff883e46e3f838] osd_submit_bio at ffffffffa10a0bcc [osd_ldiskfs] > #11 [ffff883e46e3f848] osd_do_bio at ffffffffa10a3007 [osd_ldiskfs] > #12 [ffff883e46e3f968] osd_write_commit at ffffffffa10a3b94 [osd_ldiskfs] > #13 [ffff883e46e3fa08] ofd_commitrw_write at ffffffffa1113774 [ofd] > #14 [ffff883e46e3fa80] ofd_commitrw at ffffffffa1116f2d [ofd] > #15 [ffff883e46e3fb08] obd_commitrw at ffffffffa0c43c22 [ptlrpc] > #16 [ffff883e46e3fb70] tgt_brw_write at ffffffffa0c1bfc1 [ptlrpc] > #17 [ffff883e46e3fcd8] tgt_request_handle at ffffffffa0c18275 [ptlrpc] > #18 [ffff883e46e3fd20] ptlrpc_server_handle_request at ffffffffa0bc41fb [ptlrpc] > #19 [ffff883e46e3fde8] ptlrpc_main at ffffffffa0bc82b0 [ptlrpc] > #20 [ffff883e46e3fec8] kthread at ffffffff810b06ff > #21 [ffff883e46e3ff50] ret_from_fork at ffffffff81696b98 > > > The same issue happened during a heavy IOR benchmark a few months ago and was described in https://jira.hpdd.intel.com/browse/LU-8917 (on a slightly older el7 kernel). > > I also found out that other users have described similar issues, like on this thread: https://lkml.org/lkml/2016/12/23/205 > > Now, I have a crash dump and I’m trying to understand why sh->batch_head could be set in init_stripe(), which is called by raid5_get_active_stripe() when __find_stripe() failed BUT get_free_stripe() succeeded. If sh->batch_head is set in that case, that means the idle stripe found had it set… > > Does someone have any idea of how to troubleshoot or solve this? > > Thanks! > > Stephan ��.n��������+%������w��{.n�����{����w��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f