On Wed, Oct 12, 2011 at 12:14:34AM +0530, K.Prasad wrote: > The MC4_CTL_MASK doesn't appear to be defined in the kernel. Looking at > http://support.amd.com/us/Processor_TechDocs/26094.PDF, Page 196, it > states that "This register is typically programmed by BIOS and not by > the Kernel software". Oh, this is K8 BKDG, thus pretty old. For AMD docs, you could use developer.amd.com, and more specifically http://developer.amd.com/documentation/Pages/default.aspx So if we look at the F10h manual: http://support.amd.com/us/Processor_TechDocs/31116.pdf there's this section "2.12.1.2.1 Machine Check Error Logging and Reporting" on p. 167 which explains all the modalities around switching MCE on/off. And if you clear CR4.MCE, the machine would shutdown on a fatal MCE as an additional precation when running software which doesn't support MCE (fully) but you still don't want to corrupt your data: "If error reporting is enabled but CR4.MCE is disabled, a reportable error will cause the system to enter shutdown." Thus clearing the MCi_CTL_MASK bit should help you. > So, in any case we may not be able to disable machine-check exceptions > (MCEs) only within the context of kexec'ed kernel. Let me know if I've > missed something here. I'm not sure it is advisable to completely disable MCA for the whole duration of the image dumping, especially on a system which has already booted into the second kernel due to an MCE. > > But, regardless, according to Vivek, the "makedumpfile" tool should be > > able to jump over poisoned pages and you don't need all the hoopla above > > at all, right? > > > > In short, the answer is yes. We could add a new string, say > "CRASH_REASON=PANIC_MCE" to VMCOREINFO elf-note which can be parsed by > 'makedumpfile' and get away without adding the new NT_NOCOREDUMP > elf-note. Parsing through the log_buf to lookout for panic string from > inside 'makedumpfile' appears to be a clumsy solution though. Why, 'makedumpfile' reportedly supports some dmesg parsing already - why would you need additional functionality when it can be done with in-house means already. Maybe Vivek should comment on whether this makes sense but I'm basically reiterating what he said. > i) Scenario1: System crashes because of a fatal MCE > > Proposed Solution: Add a new string in the VMCOREINFO elf-note from > within the MCE panic path to indicate cause of crash. 'makedumpfile' > recognises this string to collect a slimdump instead of the normal dump. see above. > ii) Scenario2: System with PG_hwpoison (or landmine!) pages crashes because > of a software bug. In this case, kexec kernel would normally reboot because > of reading the PG_poison page. I'll soon get a new version of the patchset > implementing this. > > Solution: Maintain a linked list of PFNs when the corresponding 'struct page' > has been marked PG_hwpoison. We could export/put this list to use in > quite a few ways. Let me stop you right there: again, according to Vivek: http://marc.info/?l=kexec&m=131805679405076&w=2 makedumpfile can iterate over the struct page arrays and skip over PG_hwpoison pages. I think this should be enough of functionality.... -- Regards/Gruss, Boris.