Hi Greg, On Fri, Jul 8, 2022 at 6:40 AM Greg Ungerer <gregungerer@xxxxxxxxxxxxxx> wrote: > > Hi All, > > I am trying to track down what may be causing a very rare and unpredictable > kernel oops dump on Mediatek MT7621 SoC based platforms. > > I mentioned this initially in this thread, > https://lore.kernel.org/all/3c4a0ab9-bc54-4584-bb27-d6045096335b@xxxxxxxxxx/T/ > but that was more about the memory debug warning than the underlying issue I > am trying to track down. > > Essentially what I am seeing is a variety of unpredictable oops and sometimes > warnings mostly within the memory sub-system. Here is a typical one: > > kernel: Unhandled kernel unaligned access[#1]: > kernel: CPU: 0 PID: 12943 Comm: sh Not tainted 5.15.0 #1 > kernel: $ 0 : 00000000 00000001 00000011 81c10000 > kernel: $ 4 : 80402e80 00000cc0 8117c39c 81c22f48 > kernel: $ 8 : 02e808a6 81b043c8 00000000 81be1000 > kernel: $12 : 81be1000 000083e7 00000000 81be1000 > kernel: $16 : 81b00000 00000000 81b043c8 80402e80 > kernel: $20 : 81c10000 00000001 8117c39c 00000cc0 > kernel: $24 : 00000000 77da5ce8 > kernel: $28 : 835be000 835bfcd0 00000081 8119bcb8 > kernel: Hi : 0000000f > kernel: Lo : 0000003c > kernel: epc : 8119bd54 kmem_cache_alloc+0xe4/0x5b8 > kernel: ra : 8119bcb8 kmem_cache_alloc+0x48/0x5b8 > kernel: Status: 11000403^IKERNEL EXL IE > kernel: Cause : 40800010 (ExcCode 04) > kernel: BadVA : 00000011 > kernel: PrId : 0001992f (MIPS 1004Kc) > kernel: Modules linked in: xt_statistic xt_realm xt_nat nf_conntrack_netlink arptable_filter arp_tables ip6table_mangle ip6table_raw ip6table_nat ip6t_ah ip6table_filter ip6_tables xt_TCPMSS xt_mark iptable_mangle xt_CT iptable_raw xt_connmark iptable_nat xt_set xt_tcpudp xt_conntrack xt_LOG nf_log_syslog xt_limit xt_addrtype ip_set_hash_netiface ip_set_hash_net ip_set_hash_ip ip_set nfnetlink nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables crypto_hw_eip93 ath10k_pci ath10k_core ath > kernel: Process sh (pid: 12943, threadinfo=24898ed1, task=5506fe17, tls=77e28e64) > kernel: Stack : 00000000 81b043c8 00000001 00000000 81bd38b0 00000000 82be00c0 81b00000 > kernel: 82be00c0 805eb600 81c10000 00000001 82be00c0 8117c39c 00000025 00000001 > kernel: 00000081 00000000 81b00000 00000078 00000000 00000081 81b00000 82be00c0 > kernel: 81b70000 00000000 00000001 8116f190 82be03c0 5e0db60a 00000cc0 00000000 > kernel: 00000000 81b00000 81b00000 556a1074 82be00c0 00000cc0 0007ffff 7ffff000 > kernel: ... > kernel: Call Trace: > kernel: [<8119bd54>] kmem_cache_alloc+0xe4/0x5b8 > kernel: [<8117c39c>] __anon_vma_prepare+0x3c/0x1a4 > kernel: [<8116f190>] handle_mm_fault+0x4e8/0xea4 > kernel: [<81167638>] __get_user_pages.part.94+0x154/0x338 > kernel: [<81167e64>] __get_user_pages_remote+0x118/0x3ac > kernel: [<811ba314>] get_arg_page+0x5c/0x108 > kernel: [<811bacec>] copy_string_kernel+0x104/0x248 > kernel: [<811bbce0>] do_execveat_common+0x148/0x1ec > kernel: [<811bcc38>] sys_execve+0x34/0x48 > kernel: [<81015570>] syscall_common+0x34/0x58 > kernel: > kernel: Code: 02c03025 8e62001c 02a21021 <8c460000> 41656000 30a50001 000000c0 8f84000c 00003825 > kernel: > > I see a fail like this ending in kmem_cache_alloc() a bit, but not always with > the same BadVA, or the same call path. Here is another: > > kernel: CPU 0 Unable to handle kernel paging request at virtual address c200002c, epc == 81198c4c, ra == 81198b94 > kernel: Oops[#1]: > kernel: CPU: 0 PID: 18368 Comm: accns_status_mo Not tainted 5.14.0 #1 > kernel: $ 0 : 00000000 00000001 c200002c 00049000 > kernel: $ 4 : 80412200 00000cc0 2f014d31 81c06f98 > kernel: $ 8 : 00100173 81ae83cc 00000000 81bc1000 > kernel: $12 : 81bc1000 00008d34 00000000 81bc1000 > kernel: $16 : 81af0000 00000000 81ae83cc 80412200 > kernel: $20 : 81bf0000 c2000000 8102a1c8 00000cc0 > kernel: $24 : 00000000 8116e93c > kernel: $28 : 86810000 86811c38 82ff3c8d 81198b94 > kernel: Hi : 0089543b > kernel: Lo : b55e0000 > kernel: epc : 81198c4c kmem_cache_alloc+0x100/0x5d4 > kernel: ra : 81198b94 kmem_cache_alloc+0x48/0x5d4 > kernel: Status: 11000403^IKERNEL EXL IE > kernel: Cause : 40800008 (ExcCode 02) > kernel: BadVA : c200002c > kernel: PrId : 0001992f (MIPS 1004Kc) > kernel: Modules linked in: xt_statistic xt_realm xt_nat nf_conntrack_netlink arptable_filter arp_tables ip6table_mangle ip6table_raw ip6table_nat ip6t_ah ip6table_filter ip6_tables xt_TCPMSS xt_mark iptable_mangle xt_CT iptable_raw xt_connmark iptable_nat xt_set xt_tcpudp xt_conntrack xt_LOG nf_log_syslog xt_limit xt_addrtype ip_set_hash_netiface ip_set_hash_net ip_set_hash_ip ip_set nfnetlink nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 crypto_hw_eip93 iptable_filter ip_tables x_tables ath10k_pci ath10k_core ath > kernel: Process accns_status_mo (pid: 18368, threadinfo=2e11cd2a, task=520aea4f, tls=77eb4e64) > kernel: Stack : 7fb51000 81003af4 00000001 81a39fdc 86810000 82ff3e40 82ff3e40 00000000 > kernel: 805e2200 82ff3c8d 81ae0000 805e2240 842da640 8102a1c8 81b7b2a0 00000000 > kernel: 00000000 00000000 fffffffc 842da600 fffffffc 81bc1000 81bc1000 00008d34 > kernel: 86811ce8 8102a92c 82ff3cd5 8100abbc 86b857f8 842da600 86b857f8 842da600 > kernel: 7fb72000 82f26240 8fc40048 c067472c 842da600 82f26240 00000021 805e2200 > kernel: ... > kernel: Call Trace: > kernel: [<81198c4c>] kmem_cache_alloc+0x100/0x5d4 > kernel: [<8102a1c8>] vm_area_dup+0x20/0x188 > kernel: [<8102a778>] dup_mm+0x204/0x464 > kernel: [<8102bb8c>] copy_process+0xf44/0x14b4 > kernel: [<8102c2f4>] kernel_clone+0x10c/0x3f4 > kernel: [<8102c7c8>] sys_fork+0x3c/0x60 > kernel: [<81015570>] syscall_common+0x34/0x58 > kernel: > kernel: Code: 00000000 8e62001c 02a21021 <8c470000> 41656000 30a50001 000000c0 8f84000c 00004025 > kernel: > kernel: ---[ end trace 166eac26610b0a43 ]--- > > Here is another that starts with a warning: > > kernel: ------------[ cut here ]------------ > kernel: WARNING: CPU: 3 PID: 18640 at mm/rmap.c:243 unlink_anon_vmas+0x24c/0x254 > kernel: Modules linked in: xt_statistic xt_realm xt_nat nf_conntrack_netlink arptable_filter arp_tables ip6table_mangle ip6table_raw ip6table_nat ip6t_ah ip6table_filter ip6_tables xt_TCPMSS xt_mark iptable_mangle xt_CT iptable_raw xt_connmark iptable_nat xt_set xt_tcpudp xt_conntrack xt_LOG nf_log_syslog xt_limit xt_addrtype ip_set_hash_netiface ip_set_hash_net ip_set_hash_ip ip_set nfnetlink nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables crypto_hw_eip93 ath10k_pci ath10k_core ath > kernel: CPU: 3 PID: 18640 Comm: surelink_test.s Not tainted 5.15.0 #1 > kernel: Stack : 00000000 855944e0 82a0aa80 810823c4 00000000 00000004 00000000 17f3f030 > kernel: 8556dcb4 81be3654 81b00000 81b08163 81a4e770 00000001 8556dc58 8049bf00 > kernel: 00000000 00000000 81a4e770 8556daf8 fffe67d0 8556db0c 00000000 6e696174 > kernel: 20646574 81bec60d 81bec63f 35312e35 81b00000 81a4e770 00000009 00000009 > kernel: 00000000 00000100 855621e0 855944e0 00000000 814af81c 000af12f 000a906f > kernel: ... > kernel: Call Trace: > kernel: [<81008ae8>] show_stack+0x38/0x118 > kernel: [<8192c014>] dump_stack_lvl+0x64/0x90 > kernel: [<819258e4>] __warn+0xc0/0xe8 > kernel: [<81925978>] warn_slowpath_fmt+0x6c/0xd0 > kernel: [<8117bfac>] unlink_anon_vmas+0x24c/0x254 > kernel: [<8116a8bc>] free_pgtables+0xa4/0x138 > kernel: [<81174c5c>] exit_mmap+0x9c/0x1f0 > kernel: [<8102a59c>] mmput+0x50/0xe8 > kernel: [<81031624>] do_exit+0x3b0/0xa74 > kernel: [<81031d6c>] do_group_exit+0x4c/0xb8 > kernel: [<81031dec>] __wake_up_parent+0x0/0x14 > kernel: > kernel: ---[ end trace 60b85e4a50ed0816 ]--- > > Here's another slightly different again: > > kernel: BUG: Bad page map in process led pte:00303a85 pmd:85b6f000 > kernel: page:79d2e80b refcount:1 mapcount:-1 mapping:687c3a7b index:0x28 pfn:0x303 > kernel: memcg:80450800 > kernel: aops:0x81952030 ino:16f dentry name:"httpd" > kernel: flags: 0x36(referenced|uptodate|lru|active|zone=0) > kernel: raw: 00000036 80014f4c 80014f94 8083ec78 00000028 00000000 fffffffe 00000001 > kernel: raw: 80450800 > kernel: page dumped because: bad pte > kernel: addr:555a8000 vm_flags:00000075 anon_vma:00000000 mapping:8083ec78 index:28 > kernel: file:sh fault:filemap_fault mmap:generic_file_readonly_mmap readpage:squashfs_readpage > kernel: CPU: 2 PID: 3166 Comm: led Not tainted 5.15.0 #1 > kernel: Stack : 00000000 80014f6c 81b03ee0 810823c4 00000000 00000004 00000000 1a6638d0 > kernel: 859e1c44 81be3654 81b00000 81b08163 81a4e770 00000001 859e1be8 80499c00 > kernel: 00000000 00000000 81a4e770 859e1a88 00133bc0 859e1a9c 00000000 63612d30 > kernel: 6d6f4320 81bf3d8c 81bf3db2 6c203a6d 81b00000 81a4e770 00000000 8083ec78 > kernel: 81b00000 555a8000 85ad0554 80014f6c 00000000 814af81c 00000008 81be0008 > kernel: ... > kernel: Call Trace: > kernel: [<81008ae8>] show_stack+0x38/0x118 > kernel: [<8192c014>] dump_stack_lvl+0x64/0x90 > kernel: [<81169774>] print_bad_pte+0x190/0x200 > kernel: [<8116c1e0>] unmap_page_range+0x6fc/0x8b4 > kernel: [<8116c72c>] unmap_vmas+0x6c/0x98 > kernel: [<81174c48>] exit_mmap+0x88/0x1f0 > kernel: [<8102a59c>] mmput+0x50/0xe8 > kernel: [<81031624>] do_exit+0x3b0/0xa74 > kernel: [<81031d6c>] do_group_exit+0x4c/0xb8 > kernel: [<81031dec>] __wake_up_parent+0x0/0x14 > > > Code paths into VMA routines are by far the most common. All the above dumps > go through there. > > The difficulty for me debugging here is that this is very rare. I have no way > at this point to reliably reproduce the behavior. Maybe once a month it happens. > Oh and this is not just a single device, the behavior has been observed on a > number of individual devices, not consistently just one device. > > The dumps I listed above are on a MediaTek MT7621 (MIPS32r2) based platform. > The platform is used as a router type device, so lots of network activity, from > wired ethernet and USB based cell modems. Though I never see any crashes/oops > on driver or hardware code paths. > > I have seen this across a variety of kernel versions. Certainly from about > 5.14 onwards. > > I tried enabling quite a few of the kernel's memory debug options, but nothing > trips for me running those. Unfortunately running with those options enabled > eventually leads to the OOM killer kicking in after a couple of hours. > > Not sure how to figure out what is leading to these? This is definitely not happening to me in my Mt7621 related boards. I have two boards: - GNUBee PC-1 - HiLink HLK-7621A evaluation board I am currently using kernel 5.15 from [1] with this configuration for GnuBee [2] and openWrt master branch [3] with kernel 5.15 and openWrt patches [4] for the HiLink board. Let me know if you want me to check something on my side. > > Regards > Greg Best regards, Sergio Paracuellos [1]: https://github.com/neilbrown/linux/tree/gnubee/v5.15 [2]: https://github.com/neilbrown/gnubee-tools/blob/master/kern_config/gbpc1u-5.15 [3]: https://github.com/openwrt/openwrt [4]: https://github.com/openwrt/openwrt/tree/master/target/linux/ramips/patches-5.15 >