> > The plan is to pass-down the list of poisoned memory pages to the second > > kernel using an elf-note so that these pages are left untouched during > > dump capture. I'm working on an implementation of the same and should > > have patches soon. > > I would say let us first figure out what happens while reading a poisoned > page and is this a problem before working on a solution. If the page is poisoned because of a real uncorrectable error in memory (reported as SRAO machine check today, or by SRAR real-soon-now). Then accessing the page from the processor while taking a memory dump will result in a machine check. Note that a large memory system that had been running for a long time may have built up a small stash of these land-mine pages - and we need to worry about them even in the case where the panic is not machine check related (in fact especially in this case ... we are in a case where we actually do want the dump to diagnose the cause of the panic, and we don't want to risk losing the crash dump because we aborted when touching a page that the OS had safely avoided for days/weeks/months). So passing a list of poisoned pages from the old kernel to the new kernel is a good idea - and is independent of the cause of the crash (except that in the fatal machine check case due to memory error the list is guaranteed to be non-empty). Passing some crash signature data - so the new kernel/dump-tools can make a choice whether to even try to take a full dump is also interesting (but independent from the bad page list). -Tony