Re: Misbehavior of md-raid RAID on failed NVMe.

Mariusz Tkaczyk <mariusz.tkaczyk@xxxxxxxxxxxxxxx> · Wed, 8 Jun 2022 10:32:09 +0200

On Wed, 8 Jun 2022 10:48:09 +0700
Pavel <pavel2000@xxxxxx> wrote:

> Hi, linux-raid community.
> 
> Today we found a strange and even scaring behavior of md-raid RAID based 
> on NVMe devices.
> 
> We ordered new server, and started data transfer (using dd, filesystems 
> was unmounted on source, etc - no errors here).
> 
> While data in transfer, kernel started IO errors reporting on one of 
> NVMe devices. (dmesg output below)
> But md-raid not reacted on them in any way. RAID array not went into any 
> failed state, and "clean" state reported all the time.
> 
> Based on earlier practice, we trusted md-raid and thought things goes ok.
> After data transfer finished, server was turned off and cables was 
> replaced on suspicion.
> 
> After OS started on this new server, we found MySQL crashes.
> Thorough checksum check showed us mismatches on files content.
> (Of course, we did checksumming of untouched files, not MySQL database 
> files)
> 
> So, there is data-loss possible when NVMe device behaves wrong.
> We think, md-raid has to remove failed device from raid in a such case.
> That it didn't happen is wrong behaviour, so want to inform community 
> about finding.
> 
> Hope, this will help to make kernel ever better.
> Thanks for your work.
> 

Hi Pavel,
IMO it is not a RAID problem. In this case some part of requests hangs
inside nvme driver and raid1d hanged too. It is rather nvme problem not a raid.
RAID should handle it well if IO errors are continuously reported.

Thanks,
Mariusz

> ---
> [Tue Jun  7 09:58:45 2022] Call Trace:
> [Tue Jun  7 09:58:45 2022]  <IRQ>
> [Tue Jun  7 09:58:45 2022]  nvme_pci_complete_rq+0x5b/0x67 [nvme]
> [Tue Jun  7 09:58:45 2022]  nvme_poll_cq+0x1e4/0x265 [nvme]
> [Tue Jun  7 09:58:45 2022]  nvme_irq+0x36/0x6e [nvme]
> [Tue Jun  7 09:58:45 2022]  __handle_irq_event_percpu+0x73/0x13e
> [Tue Jun  7 09:58:45 2022]  handle_irq_event_percpu+0x31/0x77
> [Tue Jun  7 09:58:45 2022]  handle_irq_event+0x2e/0x51
> [Tue Jun  7 09:58:45 2022]  handle_edge_irq+0xc9/0xee
> [Tue Jun  7 09:58:45 2022]  __common_interrupt+0x41/0x9e
> [Tue Jun  7 09:58:45 2022]  common_interrupt+0x6e/0x8b
> [Tue Jun  7 09:58:45 2022]  </IRQ>
> [Tue Jun  7 09:58:45 2022]  <TASK>
> [Tue Jun  7 09:58:45 2022]  asm_common_interrupt+0x1e/0x40
> [Tue Jun  7 09:58:45 2022] RIP: 0010:__blk_mq_try_issue_directly+0x12/0x136
> [Tue Jun  7 09:58:45 2022] Code: fe ff ff 48 8b 97 78 01 00 00 48 8b 92 
> 80 00 00 00 48 89 34 c2 b0 01 c3 0f 1f 44 00 00 41 57 41 56 41 55 41 54 
> 55 48 89 f5 53 <89> d3 48 83 ec 18 65 48 8b 04 25 28 00 00 00 48 89 44 
> 24 10 31 c0
> [Tue Jun  7 09:58:45 2022] RSP: 0018:ffff88810bbf7ad8 EFLAGS: 00000246
> [Tue Jun  7 09:58:45 2022] RAX: 0000000000000000 RBX: 0000000000000001 
> RCX: 0000000000000001
> [Tue Jun  7 09:58:45 2022] RDX: 0000000000000001 RSI: ffff8881137e6e40 
> RDI: ffff88810dfdf400
> [Tue Jun  7 09:58:45 2022] RBP: ffff8881137e6e40 R08: 0000000000000001 
> R09: 0000000000000a20
> [Tue Jun  7 09:58:45 2022] R10: 0000000000000000 R11: 0000000000000000 
> R12: 0000000000000000
> [Tue Jun  7 09:58:45 2022] R13: ffff88810dfdf400 R14: ffff8881137e6e40 
> R15: ffff88810bbf7df0
> [Tue Jun  7 09:58:45 2022] blk_mq_request_issue_directly+0x46/0x78
> [Tue Jun  7 09:58:45 2022] blk_mq_try_issue_list_directly+0x41/0xba
> [Tue Jun  7 09:58:45 2022]  blk_mq_sched_insert_requests+0x86/0xd0
> [Tue Jun  7 09:58:45 2022]  blk_mq_flush_plug_list+0x1b5/0x214
> [Tue Jun  7 09:58:45 2022]  ? __blk_mq_alloc_requests+0x1c7/0x21d
> [Tue Jun  7 09:58:45 2022]  blk_mq_submit_bio+0x437/0x518
> [Tue Jun  7 09:58:45 2022]  submit_bio_noacct+0x93/0x1e6
> [Tue Jun  7 09:58:45 2022]  ? bio_associate_blkg_from_css+0x137/0x15c
> [Tue Jun  7 09:58:45 2022]  flush_bio_list+0x96/0xa5
> [Tue Jun  7 09:58:45 2022]  flush_pending_writes+0x7a/0xbf
> [Tue Jun  7 09:58:45 2022]  ? md_check_recovery+0x8a/0x4bd
> [Tue Jun  7 09:58:45 2022]  raid1d+0x194/0x10e8
> [Tue Jun  7 09:58:45 2022]  ? common_interrupt+0xf/0x8b
> [Tue Jun  7 09:58:45 2022]  md_thread+0x12c/0x155
> [Tue Jun  7 09:58:45 2022]  ? init_wait_entry+0x29/0x29
> [Tue Jun  7 09:58:45 2022]  ? signal_pending+0x19/0x19
> [Tue Jun  7 09:58:45 2022]  kthread+0x104/0x10c
> [Tue Jun  7 09:58:45 2022]  ? set_kthread_struct+0x32/0x32
> [Tue Jun  7 09:58:45 2022]  ret_from_fork+0x22/0x30
> [Tue Jun  7 09:58:45 2022]  </TASK>
> [Tue Jun  7 09:58:45 2022] ---[ end trace 15dc74ae2e04f737 ]---
> [Tue Jun  7 09:58:45 2022] ------------[ cut here ]------------
> [Tue Jun  7 09:58:45 2022] refcount_t: underflow; use-after-free.