On Fri, May 27, 2011 at 11:04:06AM -0700, Eric W. Biederman wrote: > "K.Prasad" <prasad at linux.vnet.ibm.com> writes: > > > PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information > > > > Fatal machine check exceptions (caused due to hardware memory errors) will now > > result in a 'slim' coredump that captures vital information about the MCE. This > > patch introduces a new panic flag, and new parameters to *panic functions > > that can capture more information pertaining to the cause of crash. > > > > Enable a new elf-notes section to store additional information about the crash. > > For MCE, enable a new notes section that captures relevant register status > > (struct mce) to be later read during coredump analysis. > > There may be a reason to pass everything struct mce through 5 layers of > code but right now it looks like it just makes everything uglier to no > real purpose. We could have stopped with just a blank elf-note of type NT_MCE indicating an MCE triggered panic, but dumping 'struct mce' in it will help gather more useful information about the error - especially the memory address that experienced unrecoverable error (stored in mce->addr). The patch 6/6 for the 'crash' tool enabled decoding of 'struct mce' to show this information (although the sample log in patch 0/6) didn't show these benefits because 'mce-inject' tool used to soft-inject these errors doesn't populate all registers with valid contents. The idea was that when mce->addr contains physical address is shown while decoding coredump, the corresponding memory DIMM could be identified for replacement/isolation. Given that 'struct mce' isn't placed in a user-space visible file its duplicate copies have to be maintained in 'crash' (like it is done in 'mcelog' tool), and that's one disadvantage. If you think that this complicates the patch, I'll start with a much 'slimmer' version (!) of the slimdump and the improvements may be contemplated iteratively. Thanks, K.Prasad