Re: md raid6 oops in 6.6.4 stable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/7/23 09:42, Guoqing Jiang wrote:
Hi,

On 12/7/23 21:55, Genes Lists wrote:
On 12/7/23 08:30, Bagas Sanjaya wrote:
On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
I have not had chance to git bisect this but since it happened in stable I
thought it was important to share sooner than later.

One possibly relevant commit between 6.6.3 and 6.6.4 could be:

   commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
   Author: Song Liu <song@xxxxxxxxxx>
   Date:   Fri Nov 17 15:56:30 2023 -0800

     md: fix bi_status reporting in md_end_clone_io

log attached shows page_fault_oops.
Machine was up for 3 days before crash happened.

Could you decode the oops (I can't find it in lore for some reason) ([1])? And
can it be reproduced reliably? If so, pls share the reproduce step.

[1]. https://lwn.net/Articles/592724/

Thanks,
Guoqing

  - reproducing
An rsync runs 2 x / day. It copies to this server from another. The copy is from a (large) top level directory. On the 3rd day after booting 6.6.4, the second of these rysnc's triggered the oops. I need to do more testing to see if I can reliably reproduce. I have not seen this oops on earlier stable kernels.

  - decoding oops with scripts/decode_stacktrace.sh had errors :
readelf: Error: Not an ELF file - it has the wrong magic bytes at the start

It appears that the decode script doesn't handle compressed modules. I changed the readelf line to decompress first. This fixes the above script complaint and the result is attached.

gene





Dec 06 19:20:54 s6 kernel: BUG: unable to handle page fault for address: ffff8881019312e8
Dec 06 19:20:54 s6 kernel: #PF: supervisor write access in kernel mode
Dec 06 19:20:54 s6 kernel: #PF: error_code(0x0003) - permissions violation
Dec 06 19:20:54 s6 kernel: PGD 336e01067 P4D 336e01067 PUD 1019ee063 PMD 1019f0063 PTE 8000000101931021
Dec 06 19:20:54 s6 kernel: Oops: 0003 [#1] PREEMPT SMP PTI
Dec 06 19:20:54 s6 kernel: CPU: 3 PID: 773 Comm: md127_raid6 Not tainted 6.6.4-stable-1 #4 784c1c710646cffc1e8cc5978f8f6cec974aa179
Dec 06 19:20:54 s6 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370 Extreme4, BIOS P4.20 10/31/2019
Dec 06 19:20:54 s6 kernel: RIP: update_io_ticks+0x2c/0x60 
Dec 06 19:20:54 s6 kernel: Code: 1f 00 0f 1f 44 00 00 48 8b 4f 28 48 39 f1 78 17 80 7f 31 00 74 3b 48 8b 47 10 48 8b 78 40 48 8b 4f 28 48 39 f1 79 e9 48 89 c8 <f0> 48 0f b1 77 28 75 de 48 89 f0 48 29 c8 84 d2 b9 01 00 >
All code
========
   0:	1f                   	(bad)
   1:	00 0f                	add    %cl,(%rdi)
   3:	1f                   	(bad)
   4:	44 00 00             	add    %r8b,(%rax)
   7:	48 8b 4f 28          	mov    0x28(%rdi),%rcx
   b:	48 39 f1             	cmp    %rsi,%rcx
   e:	78 17                	js     0x27
  10:	80 7f 31 00          	cmpb   $0x0,0x31(%rdi)
  14:	74 3b                	je     0x51
  16:	48 8b 47 10          	mov    0x10(%rdi),%rax
  1a:	48 8b 78 40          	mov    0x40(%rax),%rdi
  1e:	48 8b 4f 28          	mov    0x28(%rdi),%rcx
  22:	48 39 f1             	cmp    %rsi,%rcx
  25:	79 e9                	jns    0x10
  27:	48 89 c8             	mov    %rcx,%rax
  2a:*	f0 48 0f b1 77 28    	lock cmpxchg %rsi,0x28(%rdi)		<-- trapping instruction
  30:	75 de                	jne    0x10
  32:	48 89 f0             	mov    %rsi,%rax
  35:	48 29 c8             	sub    %rcx,%rax
  38:	84 d2                	test   %dl,%dl
  3a:	b9                   	.byte 0xb9
  3b:	01 00                	add    %eax,(%rax)
	...

Code starting with the faulting instruction
===========================================
   0:	f0 48 0f b1 77 28    	lock cmpxchg %rsi,0x28(%rdi)
   6:	75 de                	jne    0xffffffffffffffe6
   8:	48 89 f0             	mov    %rsi,%rax
   b:	48 29 c8             	sub    %rcx,%rax
   e:	84 d2                	test   %dl,%dl
  10:	b9                   	.byte 0xb9
  11:	01 00                	add    %eax,(%rax)
	...
Dec 06 19:20:54 s6 kernel: RSP: 0018:ffffc90000c0bb78 EFLAGS: 00010296
Dec 06 19:20:54 s6 kernel: RAX: cccccccccccccccc RBX: ffff8881019312c0 RCX: cccccccccccccccc
Dec 06 19:20:54 s6 kernel: RDX: 0000000000000001 RSI: 0000000110f28f4e RDI: ffff8881019312c0
Dec 06 19:20:54 s6 kernel: RBP: 0000000000000001 R08: ffff888104cc1760 R09: 0000000080200016
Dec 06 19:20:54 s6 kernel: R10: ffff88851f0ced00 R11: ffff8888beffb000 R12: 0000000000000008
Dec 06 19:20:54 s6 kernel: R13: 0000000000000028 R14: 0000000000000008 R15: 0000000000000048
Dec 06 19:20:54 s6 kernel: FS:  0000000000000000(0000) GS:ffff88889eec0000(0000) knlGS:0000000000000000
Dec 06 19:20:54 s6 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8 CR3: 0000000336020002 CR4: 00000000003706e0
Dec 06 19:20:54 s6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 06 19:20:54 s6 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 06 19:20:54 s6 kernel: Call Trace:
Dec 06 19:20:54 s6 kernel:  <TASK>
Dec 06 19:20:54 s6 kernel: ? __die+0x23/0x70 
Dec 06 19:20:54 s6 kernel: ? page_fault_oops+0x171/0x4e0 
Dec 06 19:20:54 s6 kernel: ? exc_page_fault+0x175/0x180 
Dec 06 19:20:54 s6 kernel: ? asm_exc_page_fault+0x26/0x30 
Dec 06 19:20:54 s6 kernel: ? update_io_ticks+0x2c/0x60 
Dec 06 19:20:54 s6 kernel: bdev_end_io_acct+0x63/0x160 
Dec 06 19:20:54 s6 kernel: md_end_clone_io+0x75/0xa0 md_mod
Dec 06 19:20:54 s6 kernel: handle_stripe_clean_event+0x1ee/0x430 raid456
Dec 06 19:20:54 s6 kernel: handle_stripe+0x7b6/0x1ac0 raid456
Dec 06 19:20:54 s6 kernel: handle_active_stripes.isra.0+0x38d/0x550 raid456
Dec 06 19:20:54 s6 kernel: raid5d+0x488/0x750 raid456
Dec 06 19:20:54 s6 kernel: ? lock_timer_base+0x61/0x80 
Dec 06 19:20:54 s6 kernel: ? prepare_to_wait_event+0x60/0x180 
Dec 06 19:20:54 s6 kernel: ? __pfx_md_thread+0x10/0x10 md_mod
Dec 06 19:20:54 s6 kernel: md_thread+0xab/0x190 md_mod
Dec 06 19:20:54 s6 kernel: ? __pfx_autoremove_wake_function+0x10/0x10 
Dec 06 19:20:54 s6 kernel: kthread+0xe5/0x120 
Dec 06 19:20:54 s6 kernel: ? __pfx_kthread+0x10/0x10 
Dec 06 19:20:54 s6 kernel: ret_from_fork+0x31/0x50 
Dec 06 19:20:54 s6 kernel: ? __pfx_kthread+0x10/0x10 
Dec 06 19:20:54 s6 kernel: ret_from_fork_asm+0x1b/0x30 
Dec 06 19:20:54 s6 kernel:  </TASK>
Dec 06 19:20:54 s6 kernel: Modules linked in: algif_hash af_alg mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs nft_ct>
Dec 06 19:20:54 s6 kernel:  snd_hda_codec kvm snd_hda_core drm_buddy snd_hwdep iTCO_wdt i2c_algo_bit mei_pxp intel_pmc_bxt snd_pcm mei_hdcp ee1004 irqbypass ttm iTCO_vendor_support rapl drm_display_helper nls_iso8859_1>
Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8
Dec 06 19:20:54 s6 kernel: ---[ end trace 0000000000000000 ]---


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux