I am facing an intermittent Oops on Cavium-Octeon CN56xx. The oops is related to networking, and typically includes TIPC (virtually all inter-node comms is using TIPC). It takes about 2-3 hours of running a directed TIPC client/server test to invoke the oops. The signature varies, but typically includes either sock_sendmsg or sock_poll. Here's an example captured from syslog with sock_poll. Feb 5 11:18:42 mpc_1_1 kernel: CPU 2 Unable to handle kernel paging request at virtual address ffffffffc016f970, epc == ffffffff813e8ec4, ra == ffffffff81235658 Feb 5 11:18:44 mpc_1_1 kernel: Oops[#1]: Feb 5 11:18:44 mpc_1_1 kernel: Cpu 3 Feb 5 11:18:44 mpc_1_1 kernel: $ 0 : 0000000000000000 0000000000000001 ffffffffc016f930 a8000000e9e2aa00 Feb 5 11:18:44 mpc_1_1 kernel: $ 4 : a8000000e92f8600 0000000000000000 0000000000000000 00000000000003e8 Feb 5 11:18:44 mpc_1_1 kernel: $ 8 : 0000000000416248 00000000004186e8 000000007fdba9e0 000000000040243c Feb 5 11:18:44 mpc_1_1 kernel: $12 : 0000000000000000 ffffffffc0000008 ffffffff81235400 ffffffff89a0a600 Feb 5 11:18:44 mpc_1_1 kernel: $16 : a8000000bc9ec700 a8000000bc9ec718 0000000000000000 a8000000e92f8000 Feb 5 11:18:44 mpc_1_1 kernel: $20 : 0000000000000001 000000007fdba9b0 a8000000e92f8070 0000000000000000 Feb 5 11:18:44 mpc_1_1 kernel: $24 : 0000000000000400 0000000038fbd038 Feb 5 11:18:44 mpc_1_1 kernel: $28 : a8000000b9d64000 a8000000b9d67dc0 a8000000e92f9300 ffffffff81235658 Feb 5 11:18:44 mpc_1_1 kernel: Hi : 0000000000007d7f Feb 5 11:18:44 mpc_1_1 kernel: Lo : df3b645a1cac9d39 Feb 5 11:18:44 mpc_1_1 kernel: epc : ffffffff813e8ec4 sock_poll+0xc/0x18 Feb 5 11:18:44 mpc_1_1 kernel: Not tainted Feb 5 11:18:44 mpc_1_1 kernel: ra : ffffffff81235658 SyS_epoll_wait+0x258/0x560 Feb 5 11:18:44 mpc_1_1 kernel: Status: 1000cce3 KX SX UX KERNEL EXL IE Feb 5 11:18:44 mpc_1_1 kernel: Cause : 00800008 Feb 5 11:18:44 mpc_1_1 kernel: BadVA : ffffffffc016f970 Feb 5 11:18:44 mpc_1_1 kernel: PrId : 000d0409 (Cavium Octeon) Feb 5 11:18:44 mpc_1_1 kernel: Modules linked in: usbcore bonding i2c_dev x_tables ip6_tables ip_tables ipv6 libcrc32c sctp spioc binfmt_misc jazz_mod iptable_filter tunnel4 sit ipmi_msghandler ipmi_serial ipmi_serial_terminal_mode ipmi_devintf ipmi_watchdog tipc dti si5326 mt29f Feb 5 11:18:44 mpc_1_1 kernel: Process tipcServer_mpc (pid: 8226, threadinfo=a8000000b9d64000, task=a8000000bca5e900, tls=000000002ad009a0) Feb 5 11:18:44 mpc_1_1 kernel: Stack : a8000000b9d67dc0 a8000000b9d67dc0 0000000000000001 0000000000000000 Feb 5 11:18:44 mpc_1_1 kernel: 0000000000000081 0000000000416234 0000000000000001 000000000041e060 Feb 5 11:18:44 mpc_1_1 kernel: a8000000e92f8010 a8000000e92f8030 a8000000e92f8050 a8000000e92f8060 Feb 5 11:18:44 mpc_1_1 kernel: 00000000000000fa a8000000e92f8040 0000000000404db8 0000000000401220 Feb 5 11:18:44 mpc_1_1 kernel: 00000000004b0000 00000000004e8450 00000000004e05f8 00000000004e8450 Feb 5 11:18:44 mpc_1_1 kernel: 0000000000000000 00000000004ed948 000000007fdba968 ffffffff8114732c Feb 5 11:18:44 mpc_1_1 kernel: 0000000000000000 ffffffff81103be4 000000000000109a 000000002acf9530 Feb 5 11:18:44 mpc_1_1 kernel: 0000000000000003 000000007fdba9b0 0000000000000001 00000000000003e8 Feb 5 11:18:44 mpc_1_1 kernel: 0000000000000001 00000000203d2025 0000000025252525 ffffffff81010100 Feb 5 11:18:44 mpc_1_1 kernel: 0000000000000000 0000000000000010 ffffffff813eaf18 ffffffff89a0a600 Feb 5 11:18:44 mpc_1_1 kernel: ... Feb 5 11:18:44 mpc_1_1 kernel: Call Trace: Feb 5 11:18:44 mpc_1_1 kernel: [<ffffffff813e8ec4>] sock_poll+0xc/0x18 Feb 5 11:18:44 mpc_1_1 kernel: [<ffffffff81235658>] SyS_epoll_wait+0x258/0x560 Feb 5 11:18:44 mpc_1_1 kernel: [<ffffffff8114732c>] handle_sys+0x12c/0x148 Feb 5 11:18:44 mpc_1_1 kernel: Feb 5 11:18:44 mpc_1_1 kernel: Feb 5 11:18:44 mpc_1_1 kernel: Code: dc830098 00a0302d dc620010 <dc590040> 03200008 0060282d dc830098 00a0302d dc620010 Feb 5 11:18:44 mpc_1_1 kernel: TIPC: Resetting link <1.1.11:bond0-1.1.101:bond0>, requested by peer Feb 5 11:18:44 mpc_1_1 kernel: TIPC: Lost link <1.1.11:bond0-1.1.101:bond0> on network plane A Feb 5 11:18:44 mpc_1_1 kernel: TIPC: Lost contact with <1.1.101> Feb 5 11:18:44 mpc_1_1 kernel: TIPC: Established link <1.1.11:bond0-1.1.101:bond0> on network plane A Looking at the disassembly of sock_poll, it can be seen that the error occurs when dereferencing ops->poll to store the function pointer in register t9. 0000000000000048 <sock_poll>: struct socket *sock; /* * We can't return errors to poll, so it's either yes or no. */ sock = file->private_data; 48: dc830098 ld v1,152(a0) return sock->ops->poll(file, sock, wait); 4c: 00a0302d move a2,a1 50: dc620010 ld v0,16(v1) 54: dc590040 ld t9,64(v0) 58: 03200008 jr t9 5c: 0060282d move a1,v1 The register file corroborates BadVA to (64)v0, with v0 holding a value of ffffffffc016f930. I originally thought this address _was_ bad because all of the kernel code addresses are in the range ffffffff81xxxxxx. Then it occurred that modules might be loaded at a different address, so checking a live system: /tmp> grep ffffffffc016f930 /proc/kallsyms ffffffffc016f930 r msg_ops [tipc] /tmp> Based on this, it seems that sock->ops is valid and correct, and my original assumption about corrupt address was wrong. I'm left to conclude that the virtual address is correct, but page mapping operation is failing for some other reason. Strangely, the mapping fails only intermittently/temporarily. I conclude this because only one of many processes using TIPC will oops out, while others continue unaffected. This can be seen in the syslog above in the last 4 lines, as a TIPC link moves from Resetting->Lost->Established. Some other (less important?) details: TIPC is the only protocol loaded as a module. Kernel is 64 bit, but userspace is O32 due to some old 3rd party libraries. Typically, the processes running TIPC have their core mask set to 0x000F, to limit them to cores 0-3. I'm repeating the tests with all processes running only on core 0 to see if SMP might be a factor. What might be going on here? Could a page mapping fail even if the VA has a physical mapping in the page table? Could TIPC module be at fault (how)? What else can I look at to track down what might be happening? Best regards, -Erich