Hi All, I am trying to track down what may be causing a very rare and unpredictable kernel oops dump on Mediatek MT7621 SoC based platforms. I mentioned this initially in this thread, https://lore.kernel.org/all/3c4a0ab9-bc54-4584-bb27-d6045096335b@xxxxxxxxxx/T/ but that was more about the memory debug warning than the underlying issue I am trying to track down. Essentially what I am seeing is a variety of unpredictable oops and sometimes warnings mostly within the memory sub-system. Here is a typical one: kernel: Unhandled kernel unaligned access[#1]: kernel: CPU: 0 PID: 12943 Comm: sh Not tainted 5.15.0 #1 kernel: $ 0 : 00000000 00000001 00000011 81c10000 kernel: $ 4 : 80402e80 00000cc0 8117c39c 81c22f48 kernel: $ 8 : 02e808a6 81b043c8 00000000 81be1000 kernel: $12 : 81be1000 000083e7 00000000 81be1000 kernel: $16 : 81b00000 00000000 81b043c8 80402e80 kernel: $20 : 81c10000 00000001 8117c39c 00000cc0 kernel: $24 : 00000000 77da5ce8 kernel: $28 : 835be000 835bfcd0 00000081 8119bcb8 kernel: Hi : 0000000f kernel: Lo : 0000003c kernel: epc : 8119bd54 kmem_cache_alloc+0xe4/0x5b8 kernel: ra : 8119bcb8 kmem_cache_alloc+0x48/0x5b8 kernel: Status: 11000403^IKERNEL EXL IE kernel: Cause : 40800010 (ExcCode 04) kernel: BadVA : 00000011 kernel: PrId : 0001992f (MIPS 1004Kc) kernel: Modules linked in: xt_statistic xt_realm xt_nat nf_conntrack_netlink arptable_filter arp_tables ip6table_mangle ip6table_raw ip6table_nat ip6t_ah ip6table_filter ip6_tables xt_TCPMSS xt_mark iptable_mangle xt_CT iptable_raw xt_connmark iptable_nat xt_set xt_tcpudp xt_conntrack xt_LOG nf_log_syslog xt_limit xt_addrtype ip_set_hash_netiface ip_set_hash_net ip_set_hash_ip ip_set nfnetlink nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables crypto_hw_eip93 ath10k_pci ath10k_core ath kernel: Process sh (pid: 12943, threadinfo=24898ed1, task=5506fe17, tls=77e28e64) kernel: Stack : 00000000 81b043c8 00000001 00000000 81bd38b0 00000000 82be00c0 81b00000 kernel: 82be00c0 805eb600 81c10000 00000001 82be00c0 8117c39c 00000025 00000001 kernel: 00000081 00000000 81b00000 00000078 00000000 00000081 81b00000 82be00c0 kernel: 81b70000 00000000 00000001 8116f190 82be03c0 5e0db60a 00000cc0 00000000 kernel: 00000000 81b00000 81b00000 556a1074 82be00c0 00000cc0 0007ffff 7ffff000 kernel: ... kernel: Call Trace: kernel: [<8119bd54>] kmem_cache_alloc+0xe4/0x5b8 kernel: [<8117c39c>] __anon_vma_prepare+0x3c/0x1a4 kernel: [<8116f190>] handle_mm_fault+0x4e8/0xea4 kernel: [<81167638>] __get_user_pages.part.94+0x154/0x338 kernel: [<81167e64>] __get_user_pages_remote+0x118/0x3ac kernel: [<811ba314>] get_arg_page+0x5c/0x108 kernel: [<811bacec>] copy_string_kernel+0x104/0x248 kernel: [<811bbce0>] do_execveat_common+0x148/0x1ec kernel: [<811bcc38>] sys_execve+0x34/0x48 kernel: [<81015570>] syscall_common+0x34/0x58 kernel: kernel: Code: 02c03025 8e62001c 02a21021 <8c460000> 41656000 30a50001 000000c0 8f84000c 00003825 kernel: I see a fail like this ending in kmem_cache_alloc() a bit, but not always with the same BadVA, or the same call path. Here is another: kernel: CPU 0 Unable to handle kernel paging request at virtual address c200002c, epc == 81198c4c, ra == 81198b94 kernel: Oops[#1]: kernel: CPU: 0 PID: 18368 Comm: accns_status_mo Not tainted 5.14.0 #1 kernel: $ 0 : 00000000 00000001 c200002c 00049000 kernel: $ 4 : 80412200 00000cc0 2f014d31 81c06f98 kernel: $ 8 : 00100173 81ae83cc 00000000 81bc1000 kernel: $12 : 81bc1000 00008d34 00000000 81bc1000 kernel: $16 : 81af0000 00000000 81ae83cc 80412200 kernel: $20 : 81bf0000 c2000000 8102a1c8 00000cc0 kernel: $24 : 00000000 8116e93c kernel: $28 : 86810000 86811c38 82ff3c8d 81198b94 kernel: Hi : 0089543b kernel: Lo : b55e0000 kernel: epc : 81198c4c kmem_cache_alloc+0x100/0x5d4 kernel: ra : 81198b94 kmem_cache_alloc+0x48/0x5d4 kernel: Status: 11000403^IKERNEL EXL IE kernel: Cause : 40800008 (ExcCode 02) kernel: BadVA : c200002c kernel: PrId : 0001992f (MIPS 1004Kc) kernel: Modules linked in: xt_statistic xt_realm xt_nat nf_conntrack_netlink arptable_filter arp_tables ip6table_mangle ip6table_raw ip6table_nat ip6t_ah ip6table_filter ip6_tables xt_TCPMSS xt_mark iptable_mangle xt_CT iptable_raw xt_connmark iptable_nat xt_set xt_tcpudp xt_conntrack xt_LOG nf_log_syslog xt_limit xt_addrtype ip_set_hash_netiface ip_set_hash_net ip_set_hash_ip ip_set nfnetlink nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 crypto_hw_eip93 iptable_filter ip_tables x_tables ath10k_pci ath10k_core ath kernel: Process accns_status_mo (pid: 18368, threadinfo=2e11cd2a, task=520aea4f, tls=77eb4e64) kernel: Stack : 7fb51000 81003af4 00000001 81a39fdc 86810000 82ff3e40 82ff3e40 00000000 kernel: 805e2200 82ff3c8d 81ae0000 805e2240 842da640 8102a1c8 81b7b2a0 00000000 kernel: 00000000 00000000 fffffffc 842da600 fffffffc 81bc1000 81bc1000 00008d34 kernel: 86811ce8 8102a92c 82ff3cd5 8100abbc 86b857f8 842da600 86b857f8 842da600 kernel: 7fb72000 82f26240 8fc40048 c067472c 842da600 82f26240 00000021 805e2200 kernel: ... kernel: Call Trace: kernel: [<81198c4c>] kmem_cache_alloc+0x100/0x5d4 kernel: [<8102a1c8>] vm_area_dup+0x20/0x188 kernel: [<8102a778>] dup_mm+0x204/0x464 kernel: [<8102bb8c>] copy_process+0xf44/0x14b4 kernel: [<8102c2f4>] kernel_clone+0x10c/0x3f4 kernel: [<8102c7c8>] sys_fork+0x3c/0x60 kernel: [<81015570>] syscall_common+0x34/0x58 kernel: kernel: Code: 00000000 8e62001c 02a21021 <8c470000> 41656000 30a50001 000000c0 8f84000c 00004025 kernel: kernel: ---[ end trace 166eac26610b0a43 ]--- Here is another that starts with a warning: kernel: ------------[ cut here ]------------ kernel: WARNING: CPU: 3 PID: 18640 at mm/rmap.c:243 unlink_anon_vmas+0x24c/0x254 kernel: Modules linked in: xt_statistic xt_realm xt_nat nf_conntrack_netlink arptable_filter arp_tables ip6table_mangle ip6table_raw ip6table_nat ip6t_ah ip6table_filter ip6_tables xt_TCPMSS xt_mark iptable_mangle xt_CT iptable_raw xt_connmark iptable_nat xt_set xt_tcpudp xt_conntrack xt_LOG nf_log_syslog xt_limit xt_addrtype ip_set_hash_netiface ip_set_hash_net ip_set_hash_ip ip_set nfnetlink nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables crypto_hw_eip93 ath10k_pci ath10k_core ath kernel: CPU: 3 PID: 18640 Comm: surelink_test.s Not tainted 5.15.0 #1 kernel: Stack : 00000000 855944e0 82a0aa80 810823c4 00000000 00000004 00000000 17f3f030 kernel: 8556dcb4 81be3654 81b00000 81b08163 81a4e770 00000001 8556dc58 8049bf00 kernel: 00000000 00000000 81a4e770 8556daf8 fffe67d0 8556db0c 00000000 6e696174 kernel: 20646574 81bec60d 81bec63f 35312e35 81b00000 81a4e770 00000009 00000009 kernel: 00000000 00000100 855621e0 855944e0 00000000 814af81c 000af12f 000a906f kernel: ... kernel: Call Trace: kernel: [<81008ae8>] show_stack+0x38/0x118 kernel: [<8192c014>] dump_stack_lvl+0x64/0x90 kernel: [<819258e4>] __warn+0xc0/0xe8 kernel: [<81925978>] warn_slowpath_fmt+0x6c/0xd0 kernel: [<8117bfac>] unlink_anon_vmas+0x24c/0x254 kernel: [<8116a8bc>] free_pgtables+0xa4/0x138 kernel: [<81174c5c>] exit_mmap+0x9c/0x1f0 kernel: [<8102a59c>] mmput+0x50/0xe8 kernel: [<81031624>] do_exit+0x3b0/0xa74 kernel: [<81031d6c>] do_group_exit+0x4c/0xb8 kernel: [<81031dec>] __wake_up_parent+0x0/0x14 kernel: kernel: ---[ end trace 60b85e4a50ed0816 ]--- Here's another slightly different again: kernel: BUG: Bad page map in process led pte:00303a85 pmd:85b6f000 kernel: page:79d2e80b refcount:1 mapcount:-1 mapping:687c3a7b index:0x28 pfn:0x303 kernel: memcg:80450800 kernel: aops:0x81952030 ino:16f dentry name:"httpd" kernel: flags: 0x36(referenced|uptodate|lru|active|zone=0) kernel: raw: 00000036 80014f4c 80014f94 8083ec78 00000028 00000000 fffffffe 00000001 kernel: raw: 80450800 kernel: page dumped because: bad pte kernel: addr:555a8000 vm_flags:00000075 anon_vma:00000000 mapping:8083ec78 index:28 kernel: file:sh fault:filemap_fault mmap:generic_file_readonly_mmap readpage:squashfs_readpage kernel: CPU: 2 PID: 3166 Comm: led Not tainted 5.15.0 #1 kernel: Stack : 00000000 80014f6c 81b03ee0 810823c4 00000000 00000004 00000000 1a6638d0 kernel: 859e1c44 81be3654 81b00000 81b08163 81a4e770 00000001 859e1be8 80499c00 kernel: 00000000 00000000 81a4e770 859e1a88 00133bc0 859e1a9c 00000000 63612d30 kernel: 6d6f4320 81bf3d8c 81bf3db2 6c203a6d 81b00000 81a4e770 00000000 8083ec78 kernel: 81b00000 555a8000 85ad0554 80014f6c 00000000 814af81c 00000008 81be0008 kernel: ... kernel: Call Trace: kernel: [<81008ae8>] show_stack+0x38/0x118 kernel: [<8192c014>] dump_stack_lvl+0x64/0x90 kernel: [<81169774>] print_bad_pte+0x190/0x200 kernel: [<8116c1e0>] unmap_page_range+0x6fc/0x8b4 kernel: [<8116c72c>] unmap_vmas+0x6c/0x98 kernel: [<81174c48>] exit_mmap+0x88/0x1f0 kernel: [<8102a59c>] mmput+0x50/0xe8 kernel: [<81031624>] do_exit+0x3b0/0xa74 kernel: [<81031d6c>] do_group_exit+0x4c/0xb8 kernel: [<81031dec>] __wake_up_parent+0x0/0x14 Code paths into VMA routines are by far the most common. All the above dumps go through there. The difficulty for me debugging here is that this is very rare. I have no way at this point to reliably reproduce the behavior. Maybe once a month it happens. Oh and this is not just a single device, the behavior has been observed on a number of individual devices, not consistently just one device. The dumps I listed above are on a MediaTek MT7621 (MIPS32r2) based platform. The platform is used as a router type device, so lots of network activity, from wired ethernet and USB based cell modems. Though I never see any crashes/oops on driver or hardware code paths. I have seen this across a variety of kernel versions. Certainly from about 5.14 onwards. I tried enabling quite a few of the kernel's memory debug options, but nothing trips for me running those. Unfortunately running with those options enabled eventually leads to the OOM killer kicking in after a couple of hours. Not sure how to figure out what is leading to these? Regards Greg