On Wed, Oct 01, 2003 at 04:50:23PM -0700, Jun Sun wrote: > On Fri, Sep 12, 2003 at 11:04:16AM -0700, Craig Mautner wrote: > > We are using mips-linux 2.4.17, gcc 3.2.1 (MontaVista) and crashing in > > schedule(): > > > > Unable to handle kernel paging request at virtual address 00000000, epc == > > 800153c0, ra == 800153c0 > > $0 : 00000000 9001f800 0000001b 00000000 0000001a 83f56000 8298f4a0 0000001f > > $8 : 00000001 ffffe2e0 000022e0 00000000 fffffff9 ffffffff 0000000a 00000002 > > $16: 00000000 00000000 82af0000 8298f4a0 83f56000 00000000 80008000 00000000 > > $24: 82af1dc2 00000002 82af0000 82af1ef8 82af1ef8 800153c0 > > epc : 800153c0 Not tainted > > > > The code is: > > > > { > > struct mm_struct *mm = next->mm; > > struct mm_struct *oldmm = prev->active_mm; > > if (!mm) { > > if (next->active_mm) BUG(); <- this is where we crash > > next->active_mm = oldmm; > > atomic_inc(&oldmm->mm_count); > > enter_lazy_tlb(oldmm, next, this_cpu); > > } > > . > > . > > . > > > > This seems to happen in our case when 'next' points to 'kswapd' although we > > think it could happen when switching to any kernel task (i.e. those tasks > > with mm==NULL). > > > > We think the culprit is that we are taking an interrupt and rescheduling > > while at a vulnerable point in 'schedule()'. Interrupts are enabled in line > > 743. If we get an interrupt any time after line 785: > > > > next->active_mm = oldmm; > > > > but before line 806 > > > > __schedule_tail() > > > > completes the swap, the interrupt can force 'schedule()' to be reentered via > > 'ret_from_intr()'. > > > > If so, 'kswapd's 'active_mm' field will be left non-zero, but 'current' will > > not have been set to point to 'kswapd'. The next time 'schedule()' tries to > > switch to 'kswapd', 'next' points to 'kswapd', and > > > > next->mm == NULL > > next->active_mm != NULL > > > > which is detected as an invalid state, so we hit the BUG. > > > > Some questions: > > Are we looking at this correctly? > > Has anyone ever seen this before? > > Is there a published fix? > > > > Thanks, > > > > -Craig > > > > This is an known problem. Please try the attached patch. > > On R5432 CPU, there is also an hardware bug which can cause the same > problem. Please double-check vec3_generic to see if workaround is > at the beginning of the handler. > > BTW, 2.4.17 is an old kernel. You really need to upgrade. By the way, in 2.6 the include of <asm/war.h> has vanished from genex.S. If you want the workaround to be compiled, then you need to re-add that. -- Daniel Jacobowitz MontaVista Software Debian GNU/Linux Developer