On Wed, May 1, 2019 at 1:43 PM Marc Smith <msmith626@xxxxxxxxx> wrote: > > Hi, > > I'm using some MD RAID5 arrays with Linux 4.14.91. Everything has been > working great for sometime now, but this morning I noticed the > following snippet of kernel messages: > --snip-- > Apr 30 23:49:09 node1 kernel: [10496.092367] stripe state: 2001 > Apr 30 23:49:09 node1 kernel: [10496.092395] ------------[ cut here > ]------------ > Apr 30 23:49:09 node1 kernel: [10496.092408] WARNING: CPU: 13 PID: > 3786 at drivers/md/raid5.c:4611 break_stripe_batch_list+0x86/0x1fb > Apr 30 23:49:09 node1 kernel: [10496.092410] Modules linked in: > scst_qla2xxx(O) fcst(O) scst_changer(O) scst_tape(O) scst_vdisk(O) > scst_disk(O) ib_srpt(O) isert_scst(O) iscsi_scst(O) scst(O) qla2xxx(O) > bonding ntb_netdev ntb_hw_switchtec(O) cls(O) mlx5_core bna ib_umad > rdma_ucm ib_uverbs ib_srp iw_nes iw_cxgb4 cxgb4 iw_cxgb3 ib_qib rdmavt > mlx4_ib ib_mthca > Apr 30 23:49:09 node1 kernel: [10496.092450] CPU: 13 PID: 3786 Comm: > md125_raid5 Tainted: G O 4.14.91-esos.prod #1 > Apr 30 23:49:09 node1 kernel: [10496.092452] Hardware name: > CELESTICA-CSS Athena/Athena-MB, BIOS COL00708 11/26/2018 > Apr 30 23:49:09 node1 kernel: [10496.092455] task: ffff888f84183b40 > task.stack: ffffc9000b2ec000 > Apr 30 23:49:09 node1 kernel: [10496.092459] RIP: > 0010:break_stripe_batch_list+0x86/0x1fb > Apr 30 23:49:09 node1 kernel: [10496.092462] RSP: > 0018:ffffc9000b2efc40 EFLAGS: 00010286 > Apr 30 23:49:09 node1 kernel: [10496.092465] RAX: 0000000000000012 > RBX: ffff888f182aaad0 RCX: 0000000000000000 > Apr 30 23:49:09 node1 kernel: [10496.092467] RDX: ffff88903fb5d001 > RSI: ffff88903fb554c8 RDI: ffff88903fb554c8 > Apr 30 23:49:09 node1 kernel: [10496.092469] RBP: ffff888f25222240 > R08: 0000000000000001 R09: 0000000000020300 > Apr 30 23:49:09 node1 kernel: [10496.092471] R10: 0000000000000000 > R11: 00000000000fe6b4 R12: 0000000000000000 > Apr 30 23:49:09 node1 kernel: [10496.092473] R13: ffff888f4b1e3360 > R14: 0000000000001c04 R15: ffff888efcffab18 > Apr 30 23:49:09 node1 kernel: [10496.092476] FS: > 0000000000000000(0000) GS:ffff88903fb40000(0000) > knlGS:0000000000000000 > Apr 30 23:49:09 node1 kernel: [10496.092478] CS: 0010 DS: 0000 ES: > 0000 CR0: 0000000080050033 > Apr 30 23:49:09 node1 kernel: [10496.092480] CR2: 00007f834dbce698 > CR3: 0000000002812005 CR4: 00000000007606e0 > Apr 30 23:49:09 node1 kernel: [10496.092483] DR0: 0000000000000000 > DR1: 0000000000000000 DR2: 0000000000000000 > Apr 30 23:49:09 node1 kernel: [10496.092485] DR3: 0000000000000000 > DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Apr 30 23:49:09 node1 kernel: [10496.092486] PKRU: 55555554 > Apr 30 23:49:09 node1 kernel: [10496.092487] Call Trace: > Apr 30 23:49:09 node1 kernel: [10496.092498] handle_stripe+0xcdf/0x1958 > Apr 30 23:49:09 node1 kernel: [10496.092507] ? enqueue_task_fair+0x219/0x96b > Apr 30 23:49:09 node1 kernel: [10496.092513] > handle_active_stripes.isra.26+0x329/0x396 > Apr 30 23:49:09 node1 kernel: [10496.092518] raid5d+0x302/0x47f > Apr 30 23:49:09 node1 kernel: [10496.092522] ? del_timer_sync+0x22/0x2c > Apr 30 23:49:09 node1 kernel: [10496.092530] ? md_register_thread+0xc1/0xc1 > Apr 30 23:49:09 node1 kernel: [10496.092534] ? md_thread+0x12b/0x13d > Apr 30 23:49:09 node1 kernel: [10496.092537] md_thread+0x12b/0x13d > Apr 30 23:49:09 node1 kernel: [10496.092544] ? wait_woken+0x68/0x68 > Apr 30 23:49:09 node1 kernel: [10496.092552] kthread+0x117/0x11f > Apr 30 23:49:09 node1 kernel: [10496.092557] ? kthread_create_on_node+0x3a/0x3a > Apr 30 23:49:09 node1 kernel: [10496.092564] ret_from_fork+0x35/0x40 > Apr 30 23:49:09 node1 kernel: [10496.092568] Code: 48 89 83 90 00 00 > 00 f7 c6 a9 c2 eb 00 74 1e 80 3d 12 74 f6 00 00 75 15 48 c7 c7 bf c8 > 56 82 c6 05 02 74 f6 00 01 e8 4b 6f 6b ff <0f> 0b 48 8b 75 48 f7 c6 20 > 00 08 00 74 1e 80 3d e7 73 f6 00 00 > Apr 30 23:49:09 node1 kernel: [10496.092629] ---[ end trace > 90e17afe3799d471 ]--- > --snip-- > > I see that comes from break_stripe_batch_list() in > linux-4.14.91/drivers/md/raid5.c: > --snip-- > WARN_ONCE(sh->state & ((1 << STRIPE_ACTIVE) | > (1 << STRIPE_SYNCING) | > (1 << STRIPE_REPLACED) | > (1 << STRIPE_DELAYED) | > (1 << STRIPE_BIT_DELAY) | > (1 << STRIPE_FULL_WRITE) | > (1 << STRIPE_BIOFILL_RUN) | > (1 << STRIPE_COMPUTE_RUN) | > (1 << STRIPE_OPS_REQ_PENDING) | > (1 << STRIPE_DISCARD) | > (1 << STRIPE_BATCH_READY) | > (1 << STRIPE_BATCH_ERR) | > (1 << STRIPE_BITMAP_PENDING)), > "stripe state: %lx\n", sh->state); > --snip-- > > I see the "stripe state: 2001" value in the log. I can go through and > decode, but I'm still probably not going to be sure what's expected or > wrong. The MD array seems to be functioning correctly, I'm not seeing > anymore errors but I do understand the statement above is WARN_ONCE(). So for 0x2001, it is just the STRIPE_ACTIVE bit. > > Is this a sign of corruption / serious issue, or transient problem? > Any additional debug steps that I can perform to collect more data? I > searched a bit on Google for this error, but didn't get any relevant > hits. Any help would be greatly appreciated. This one looks like race condition in head_sh and sh in the list, so it doesn't seem too bad. Could you try reboot the system and see whether this happen again? Thanks, Song > > Thanks, > > Marc