On Tue, Aug 15, 2017 at 05:59:38PM +0000, Stephane Thiell wrote: > > > On Aug 15, 2017, at 2:24 AM, Guoqing Jiang <gqjiang@xxxxxxxx> wrote: > > > > > > > > On 08/14/2017 01:29 AM, Stephane Thiell wrote: > >> Also known as "kernel BUG at drivers/md/raid5.c:527!" > >> > >> I’m still struggling with this issue in raid5.c. I currently have one crash per week in average on a system using Lustre and 24 x 10-disk raid6 per server. It did happen with or without check running. I now think there is a race at some point where a released stripe head (batched) is added to conf->inactive_list without sh->batch_head being set back to NULL. > >> > >> Because sh->batch_head is only being set back to NULL in a single function, break_stripe_batch_list(), my guess is that a call to raid5_release_stripe() on the head stripe is performed elsewhere. I didn’t find where this could come from yet. So perhaps sh->batch_head should be just set to NULL in do_release_stripe(), before being added to the inactive list? > >> > >> Alternatively, I wonder if it would make more sense to set batch_head to NULL where sh->overwrite_disks is also set back to 0 (like in ops_run_biodrain() today). > >> > >> Because sh->overwrite_disks is also set to 0 in init_stripe(), my current plan is to try this change, at least as a workaround: remove the BUG_ON(sh->batch_head) from init_stripe() and set sh->batch_head = NULL (I don’t think I need to lock on sh->stripe_lock there?). I will get new hardware soon so I will give it a try on a test system. Any suggestions welcomed… > >> > >> Quick analysis of the last vmcore: > >> > >> crash> ps | grep ">" > >> ... > >>> 16119 16111 22 ffff881f93bace70 RU 0.0 150772 10748 e2label > >>> 151448 2 34 ffff881ff753bec0 RU 0.0 0 0 [kiblnd_sd_00_00] > >>> 151449 2 2 ffff881ff753edd0 RU 0.0 0 0 [kiblnd_sd_00_01] > >>> 154637 2 29 ffff881f776b1f60 RU 0.0 0 0 [md32_raid6] > >>> 163643 2 11 ffff8820d200bec0 UN 0.0 0 0 [ll_ost_io01_008] > >>> 164083 2 13 ffff8820d2753ec0 UN 0.0 0 0 [ll_ost_io01_034] > >>> 171389 2 5 ffff883fd072de20 UN 0.0 0 0 [ll_ost_io01_058] > >>> 171391 2 17 ffff883fd072ce70 UN 0.0 0 0 [ll_ost_io01_060] > >>> 171399 2 33 ffff883fd9e9ce70 UN 0.0 0 0 [ll_ost_io01_068] > >>> 237019 2 15 ffff882715628fb0 UN 0.0 0 0 [ll_ost_io01_084] > >>> 237033 2 3 ffff881fb0250fb0 UN 0.0 0 0 [ll_ost_io01_098] > >> ... > >> > >> crash> bt 237033 > >> PID: 237033 TASK: ffff881fb0250fb0 CPU: 3 COMMAND: "ll_ost_io01_098" > >> #0 [ffff8826fa3e7240] machine_kexec at ffffffff81059bdb > >> #1 [ffff8826fa3e72a0] __crash_kexec at ffffffff811057c2 > >> #2 [ffff8826fa3e7370] crash_kexec at ffffffff811058b0 > >> #3 [ffff8826fa3e7388] oops_end at ffffffff8168fac8 > >> #4 [ffff8826fa3e73b0] die at ffffffff8102e93b > >> #5 [ffff8826fa3e73e0] do_trap at ffffffff8168f180 > >> #6 [ffff8826fa3e7430] do_invalid_op at ffffffff8102b144 > >> #7 [ffff8826fa3e74e0] invalid_op at ffffffff81698b5e > >> [exception RIP: raid5_get_active_stripe+1809] > >> RIP: ffffffffa0fefd71 RSP: ffff8826fa3e7590 RFLAGS: 00010086 > >> RAX: 0000000000000000 RBX: ffff883df051641c RCX: dead000000000200 > >> RDX: 0000000000000000 RSI: 0000000000000005 RDI: ffff881e08d54d30 > >> RBP: ffff8826fa3e7638 R8: ffff881e08d54d40 R9: 0000000000000080 > >> R10: 0000000000000000 R11: 0000000000000000 R12: ffff883df0516400 > >> R13: ffff881e08d54d30 R14: 0000000000000000 R15: 0000000000000000 > >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > >> #8 [ffff8826fa3e7640] raid5_make_request at ffffffffa0feff55 [raid456] > >> #9 [ffff8826fa3e7730] md_make_request at ffffffff814f838c > >> #10 [ffff8826fa3e7790] generic_make_request at ffffffff812eedc9 > >> #11 [ffff8826fa3e77d8] submit_bio at ffffffff812eef11 > >> #12 [ffff8826fa3e7830] osd_submit_bio at ffffffffa10cdbcc [osd_ldiskfs] > >> #13 [ffff8826fa3e7840] osd_do_bio at ffffffffa10d03c1 [osd_ldiskfs] > >> #14 [ffff8826fa3e7960] osd_write_commit at ffffffffa10d0b94 [osd_ldiskfs] > >> #15 [ffff8826fa3e7a00] ofd_commitrw_write at ffffffffa120c734 [ofd] > >> #16 [ffff8826fa3e7a78] ofd_commitrw at ffffffffa120ff07 [ofd] > >> #17 [ffff8826fa3e7b08] obd_commitrw at ffffffffa0c7dc22 [ptlrpc] > >> #18 [ffff8826fa3e7b70] tgt_brw_write at ffffffffa0c55fc1 [ptlrpc] > >> #19 [ffff8826fa3e7cd8] tgt_request_handle at ffffffffa0c52275 [ptlrpc] > >> #20 [ffff8826fa3e7d20] ptlrpc_server_handle_request at ffffffffa0bfe1fb [ptlrpc] > >> #21 [ffff8826fa3e7de8] ptlrpc_main at ffffffffa0c022b0 [ptlrpc] > >> #22 [ffff8826fa3e7ec8] kthread at ffffffff810b0a4f > >> #23 [ffff8826fa3e7f50] ret_from_fork at ffffffff81697218 > >> > >> > >> After some disassembly work, I found out that all active tasks are working on the same md (all are raid6): > >> > >> md32’s (struct r5conf *) is 0xffff883df0516400 > >> > >> Our crashed task: > >> > >> crash> printf "%s", ((struct r5conf)*0xffff883df0516400).cache_name[0] > >> raid6-md32 > >> > >> Example for md32_raid6 task in release_inactive_stripe_list: > >> > >> crash> bt 154637 > >> PID: 154637 TASK: ffff881f776b1f60 CPU: 29 COMMAND: "md32_raid6" > >> #0 [ffff883ffc985e48] crash_nmi_callback at ffffffff8104d2e2 > >> #1 [ffff883ffc985e58] nmi_handle at ffffffff8168fc77 > >> #2 [ffff883ffc985eb0] do_nmi at ffffffff8168fe83 > >> #3 [ffff883ffc985ef0] end_repeat_nmi at ffffffff8168f093 > >> [exception RIP: _raw_spin_lock_irqsave+64] > >> RIP: ffffffff8168e5f0 RSP: ffff881e8d0cbc60 RFLAGS: 00000006 > >> RAX: 00000000000031bb RBX: ffff883df0516740 RCX: 000000000000009a > >> RDX: 0000000000000098 RSI: 000000000000009a RDI: ffff883df051641c > >> RBP: ffff881e8d0cbc60 R8: 0000000000000206 R9: 0000000000000000 > >> R10: 0000000000000000 R11: 0000000000000007 R12: ffff883df0516400 > >> R13: ffff883df051641c R14: 0000000000000006 R15: 0000000000000005 > >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > >> --- <NMI exception stack> --- > >> #4 [ffff881e8d0cbc60] _raw_spin_lock_irqsave at ffffffff8168e5f0 > >> #5 [ffff881e8d0cbc68] release_inactive_stripe_list at ffffffffa0fe5500 [raid456] > >> #6 [ffff881e8d0cbcb0] handle_active_stripes at ffffffffa0ff34c2 [raid456] > >> #7 [ffff881e8d0cbd60] raid5d at ffffffffa0ff3c28 [raid456] > >> #8 [ffff881e8d0cbe50] md_thread at ffffffff814fb745 > >> #9 [ffff881e8d0cbec8] kthread at ffffffff810b0a4f > >> #10 [ffff881e8d0cbf50] ret_from_fork at ffffffff81697218 > >> > >> > >> crash> bt 163643 164083 171389 171391 171399 237019 237033 |grep "exception RIP:" > >> [exception RIP: _raw_spin_lock_irq+63] > >> [exception RIP: _raw_spin_lock_irq+63] > >> [exception RIP: _raw_spin_lock_irq+58] > >> [exception RIP: _raw_spin_lock_irq+63] > >> [exception RIP: _raw_spin_lock_irq+58] > >> [exception RIP: _raw_spin_lock_irq+63] > >> [exception RIP: raid5_get_active_stripe+1809] > >> > >> crash> bt 163643 164083 171389 171391 171399 237019 237033 |grep "raid5_get_active_stripe" > >> #5 [ffff883bcee93588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456] > >> #5 [ffff883ba4af7588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456] > >> #5 [ffff88390877b588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456] > >> #5 [ffff88390803b588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456] > >> #5 [ffff88390831f588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456] > >> #5 [ffff8826fa7f7588] raid5_get_active_stripe at ffffffffa0fef6cb [raid456] > >> [exception RIP: raid5_get_active_stripe+1809] > >> > >> crash> bt 163643 164083 171389 171391 171399 237019 237033 |grep R12 > >> R10: 0000000000000003 R11: 0000000000000000 R12: ffff883df0516400 > >> R10: 0000000000000001 R11: 0000000000000000 R12: ffff883df0516400 > >> R10: 0000000000000005 R11: 0000000000000000 R12: ffff883df0516400 > >> R10: 0000000000000007 R11: 0000000000000000 R12: ffff883df0516400 > >> R10: 0000000000000009 R11: 0000000000000000 R12: ffff883df0516400 > >> R10: 0000000000000009 R11: 0000000000000000 R12: ffff883df0516400 > >> R10: 0000000000000000 R11: 0000000000000000 R12: ffff883df0516400 > >> > >> > >> The bad stripe_head (sh) is available at (usually in %r13): > >> > >> crash> p ((struct stripe_head)*0xffff881e08d54d30).raid_conf > >> $98 = (struct r5conf *) 0xffff883df0516400 > >> > >> and batch_head is not NULL... that's the issue: > >> > >> crash> p ((struct stripe_head)*0xffff881e08d54d30).batch_head > >> $99 = (struct stripe_head *) 0xffff881e08d54d30 > >> > >> Full struct dump below. We can see than overwrite_disks is set to 0, state is 0… but this batch_head not set to NULL makes it crash in init_stripe(): > >> > >> crash> p ((struct stripe_head)*0xffff881e08d54d30) > >> $6 = { > >> hash = { > >> next = 0x0, > >> pprev = 0x0 > >> }, > >> lru = { > >> next = 0xffff881e08d54d40, > >> prev = 0xffff881e08d54d40 > >> }, > >> release_list = { > >> next = 0xffff881dfe405cc0 > >> }, > >> raid_conf = 0xffff883df0516400, > >> generation = 0, > >> sector = 3701224808, > >> pd_idx = 1, > >> qd_idx = 2, > >> ddf_layout = 0, > >> hash_lock_index = 5, > >> state = 0, > >> count = { > >> counter = 0 > >> }, > >> bm_seq = 22708787, > >> disks = 10, > >> overwrite_disks = 0, > >> check_state = check_state_idle, > >> reconstruct_state = reconstruct_state_idle, > >> stripe_lock = { > >> { > >> rlock = { > >> raw_lock = { > >> { > >> head_tail = 3709263126, > >> tickets = { > >> head = 56598, > >> tail = 56598 > >> } > >> } > >> } > >> } > >> } > >> }, > >> cpu = 15, > >> group = 0x0, > >> batch_head = 0xffff881e08d54d30, > >> batch_lock = { > >> { > >> rlock = { > >> raw_lock = { > >> { > >> head_tail = 9961624, > >> tickets = { > >> head = 152, > >> tail = 152 > >> } > >> } > >> } > >> } > >> } > >> }, > >> batch_list = { > >> next = 0xffff881dfe405d28, > >> prev = 0xffff881e0e2dec98 > >> }, > >> log_io = 0x0, > >> log_list = { > >> next = 0x0, > >> prev = 0x0 > >> }, > >> ops = { > >> target = 0, > >> target2 = 0, > >> zero_sum_result = (unknown: 0) > >> }, > >> dev = {{ > >> req = { > >> bi_sector = 0, > >> bi_next = 0x0, > >> bi_bdev = 0x0, > >> bi_flags = 1, > >> bi_rw = 0, > >> bi_vcnt = 0, > >> bi_idx = 0, > >> bi_phys_segments = 0, > >> bi_size = 0, > >> bi_seg_front_size = 0, > >> bi_seg_back_size = 0, > >> bi_end_io = 0x0, > >> bi_private = 0xffff881e08d54d30, > >> bi_ioc = 0x0, > >> bi_css = 0x0, > >> bi_integrity = 0x0, > >> bi_max_vecs = 1, > >> bi_cnt = { > >> counter = 1 > >> }, > >> bi_io_vec = 0xffff881e08d54f00, > >> bi_pool = 0x0, > >> { > >> bio_aux = 0x0, > >> __UNIQUE_ID_rh_kabi_hide15 = { > >> rh_reserved1 = 0x0 > >> }, > >> {<No data fields>} > >> }, > >> bi_inline_vecs = 0xffff881e08d54e78 > >> }, > >> rreq = { > >> bi_sector = 0, > >> bi_next = 0x0, > >> bi_bdev = 0x0, > >> bi_flags = 1, > >> bi_rw = 0, > >> bi_vcnt = 0, > >> bi_idx = 0, > >> bi_phys_segments = 0, > >> bi_size = 0, > >> bi_seg_front_size = 0, > >> bi_seg_back_size = 0, > >> bi_end_io = 0x0, > >> bi_private = 0xffff881e08d54d30, > >> bi_ioc = 0x0, > >> bi_css = 0x0, > >> bi_integrity = 0x0, > >> bi_max_vecs = 1, > >> bi_cnt = { > >> counter = 1 > >> }, > >> bi_io_vec = 0xffff881e08d54f10, > >> bi_pool = 0x0, > >> { > >> bio_aux = 0x0, > >> __UNIQUE_ID_rh_kabi_hide15 = { > >> rh_reserved1 = 0x0 > >> }, > >> {<No data fields>} > >> }, > >> bi_inline_vecs = 0xffff881e08d54f00 > >> }, > >> vec = { > >> bv_page = 0xffffea0078236080, > >> bv_len = 4096, > >> bv_offset = 0 > >> }, > >> rvec = { > >> bv_page = 0x0, > >> bv_len = 0, > >> bv_offset = 0 > >> }, > >> page = 0xffffea0078236080, > >> orig_page = 0xffffea0078236080, > >> toread = 0x0, > >> read = 0x0, > >> towrite = 0x0, > >> written = 0xffff8824832a7c00, > >> sector = 29609798632, > >> flags = 24, > > > > Seems the flags is R5_OVERWRITE | R5_Insync, handle_stripe_clean_event > > only calls break_stripe_batch_list if do_endio is set to true. > > > > if (head_sh->batch_head && do_endio) > > break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS); > > > > And do_endio is set to true if we meet the below condition. > > > > if (!test_bit(R5_LOCKED, &dev->flags) && > > (test_bit(R5_UPTODATE, &dev->flags) || > > test_bit(R5_Discard, &dev->flags) || > > test_bit(R5_SkipCopy, &dev->flags))) { > > > > I guess it is why the batch_head is not null, just my 2 cents. > > > That’s a very good point, thanks Guoqing! > > I checked another similar crash dump, and found out that this same flags was 16 (R5_Insync alone), which doesn’t meet the condition for having do_endio = true neither. > > The code below in raid5_end_write_request does directly release a sh->batch_head, couldn’t that be an issue? > > static void raid5_end_write_request(struct bio *bi) > { > > ... > > raid5_release_stripe(sh); > > if (sh->batch_head && sh != sh->batch_head) > raid5_release_stripe(sh->batch_head); > } Can you dump the full stripe_head info for this crash dump? Also stripe_head should include 10 dev for your 10-disk raid, but crash dump only show one, is it possible you can dump the info for the devs in the stripe? -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html