On Wed, 24 Nov 2010 13:21:25 +0000, gary.murray@xxxxxxxxxxxxxxxxxxxxxx wrote: > The goal of this post is to jog the memory of the MIPS community. We are > working of an old kernel 2.6.20.19. Unfortunately we are not in a position > to upgrade our kernel at this time. We understand by doing so our problems > could potentially disappear. But hopefully with the following description > and detailed analysis some of you will recognize it. You should post mail in text format not in HTML, otherwise your mail will be filtered out. > Our problem is essentially a register value changes under the running > thread resulting in kernel oops and panics. The frequency of occurrence > is completely spurious as are the resulting oops reports i.e. it may take > 30 Hours for the issue to occur. In trawling the archives we noticed two > patches dealing with context switching, saving and restoring registers. > The first patch "Disallow CpU exception in kernel again" was removed and > superseded. Our kernel version includes these changes, but as they are > relatively new, maybe additional functionality or bug fixes occurred in > later releases. Our kernel is configured with preempt enabled. > > The PATCH "Retry {save,restore}_fp_context if failed in atomic context". > http://www.linux-mips.org/archives/linux-mips/2007-04/msg00087.html > > The patch first appeared in mainline 2.6.21.1. There were some minor > tweaks that were cosmetic in nature. > > Below is an analysis of a kernel oops with disassembly demonstrating a > register change. Note our kernel contains this patch, but this patch is > the most relevant resembling or addressing our problems. The issue could > very well be HW related. Your analysis looks correct, but I think above patches are irrelevant since they are fixes for FPU registers, not for general purpose registers. > CPU 0 Unable to handle kernel paging request at virtual address 03c9dabc, > epc == 800af9c8, ra == 800af974 > Oops[#1]: > Cpu 0 > $ 0 : 00000000 00000001 00000001 00020000 > $ 4 : 00000001 82ab0228 00000000 834c31f8 > $ 8 : 00000000 00000000 00000001 00000000 t0, t1, t2, t3 > $12 : 03c9da80 80393a80 00000000 00000000 t4, t5, t6, t7 > $16 : 8390a000 00000019 03c9dab0 00000019 s0, s1, s2, s3 > $20 : 00000019 00000000 00000001 828f3e18 > $24 : 00000000 828a3d30 > $28 : 828f2000 828f3d78 80486de0 800af974 > Hi : 00000fff > Lo : 97248a23 > epc : 800af9c8 pipe_write+0xc8/0x840 Tainted: P > ra : 800af974 pipe_write+0x74/0x840 > Status: 1100ff03 KERNEL EXL IE > Cause : 00800008 > BadVA : 03c9dabc > PrId : 00019641 > Modules linked in: ipt_REJECT ast_read_timer lakefxo zaptel drv_dect > drv_duslic xhfc mISDN_core drv_vinetic drv80c823 drv_vmmc drv_tapi > drv_fpga_core drv_mps cls_rsvp6 cls_rsvp xt_LED > nf_nat_h323 nf_conntrack_h323 nf_nat_pptp nf_conntrack_pptp > nf_nat_proto_gre nf_conntrack_proto_gre nf_nat_sip nf_conntrack_sip > nf_nat_tftp nf_conntrack_tftp nf_nat_irc nf_conntrack_ir > c nf_nat_ftp nf_conntrack_ftp ipt_LOG ipt_TCPMSS ipt_MASQUERADE xt_limit > xt_state xt_tcpudp iptable_nat nf_conntrack_ipv4 nf_nat nfnetlink > iptable_filter ip_tables x_tables ath_rate_sa > mple ath_pci ath_hal(P) wlan_xauth wlan_wep wlan_tkip wlan_ccmp wlan_acl > wlan_scan_sta wlan_scan_ap wlan > Process pabx (pid: 5977, threadinfo=828f2000, task=834c31f8) > Stack : 00000001 00000000 00000004 00000000 00000001 828f3dd0 7ff047e8 > 00000000 > 00000009 7ff04868 828f3dd0 800bc544 8183ac00 828f3db8 828f3f20 > 80073db8 > 828f3dd0 828f3dd4 00000000 00000000 828f3de0 828f3de4 00000100 > 825c6d90 > 828f3e20 80488e40 80486de0 fffffffd 828f3f18 ffffffff 00000009 > 00000017 > 00000000 800a4de8 80052dbc 800b174c 1002678c 00000001 00000000 > 00000000 > ... > Call Trace: > [<800af9c8>] pipe_write+0xc8/0x840 > [<800a4de8>] do_sync_write+0xd0/0x240 > [<800a51c8>] vfs_write+0x270/0x290 > [<800a52dc>] sys_write+0x54/0xa0 > [<80012c04>] stack_done+0x20/0x3c > > > Code: 000e6880 01b06021 25920034 <8e55000c> 8e4a0004 8e4b0008 8ea90000 > 014b2021 152000d9 > > > (gdb) disassemble pipe_write > Dump of assembler code for function pipe_write: > . > . > 0x00000928 <pipe_write+184>: addu t6,t7,s2 > 0x0000092c <pipe_write+188>: sll t5,t6,0x2 > 0x00000930 <pipe_write+192>: addu t4,t5,s0 > 0x00000934 <pipe_write+196>: addiu s2,t4,52 > 0x00000938 <pipe_write+200>: lw s5,12(s2) FAILING INSTRUCTION > 0x0000093c <pipe_write+204>: lw t2,4(s2) > 0x00000940 <pipe_write+208>: lw t3,8(s2) > . > . > End of assembler dump. > > > 000e6880 => sll t6,t5,0x2 > 01b06021 => addu t4,t5,s0 > 25920034 => addiu s2,t4,52 > 8e55000c => lw s5,12(s2) > > > The faulting instruction is marked above. From this we see that t5($13) > contains the wrong value. It should contain the value of t6($14) left > shifted 2 bits, that is to say zero. Instead it contains a value > 80393a80. > > At least one other register is not as expected. a2($6) should not contain > the value 0, otherwise the branch at <pipe_write+156> would have been > taken. Also, if the instruction at <pipe_write+144> was executed, then > a3($7) should contain either 0 or 1, not an address. Likewise, t0($8) > should contain the value 1, given the value of s3($19). > > > > (gdb) list *pipe_write+200 > 0x938 is in pipe_write (fs/pipe.c:368). > 363 chars = total_len & (PAGE_SIZE-1); /* size of the last > buffer */ > 364 if (pipe->nrbufs && chars != 0) { > 365 int lastbuf = (pipe->curbuf + pipe->nrbufs - 1) & > 366 (PIPE_BUFFERS-1); > 367 struct pipe_buffer *buf = pipe->bufs + lastbuf; > 368 FAULTING C CODE const struct pipe_buf_operations *ops > = buf->ops; > 369 int offset = buf->offset + buf->len; > 370 > 371 if (ops->can_merge && offset + chars <= PAGE_SIZE) > { > 372 int error, atomic = 1; > > > > Any idea out there?? No idea for now... --- Atsushi Nemoto