On Thu, 5 Nov 2009, David Miller wrote: > From: Mikulas Patocka <mpatocka@xxxxxxxxxx> > Date: Wed, 21 Oct 2009 14:24:59 -0400 (EDT) > > > The fault_code variable that triggered is in l2, it's 0xfe, the fault > > address is in l3. Do you have any idea how this could (or couldn't) > > happen? > > The fault_code on sparc64 is a bitmask which should contain only the > following bit values (some of which are exclusive): > > #define FAULT_CODE_WRITE 0x01 /* Write access, implies D-TLB */ > #define FAULT_CODE_DTLB 0x02 /* Miss happened in D-TLB */ > #define FAULT_CODE_ITLB 0x04 /* Miss happened in I-TLB */ > #define FAULT_CODE_WINFIXUP 0x08 /* Miss happened during spill/fill */ > #define FAULT_CODE_BLKCOMMIT 0x10 /* Use blk-commit ASI in copy_page */ > > 0xfe is an illegal value. > > I suspect that once you hit this IDE bug, the IDE controller is > spamming garbage via DMA all over memory corrupting things. There is another thing that contradicts this. This BUG() really happened twice for the same "vmstat" program when I ran it consecutively. On the same faulting address. After stopping simultaneous I/O and clearing the cache with "echo 3 >/proc/sys/vm/drop_caches", the machine ran reliably, including that vmstat command (I rebooted it anyway fearing hidden data corruption, but there were really no more program failures). - So, if the controller corrupted kernel code, the machine wouldn't recover. - If the controller corrupted common kernel data, the bug would show on all processes or on all "vmstat" processes and it wouldn't go away after clearing disk cache. - If the controller corrupted per-process kernel data, the probability that it corrupted two processes in the same way is small. - Other idea? Sadly I don't have copy of the corrupted binary, I wasn't at the console and I found out about the BUG later :-/ Mikulas -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html