On Tue 22-10-19 11:58:52, Oscar Salvador wrote: > On Tue, Oct 22, 2019 at 11:22:56AM +0200, Michal Hocko wrote: > > Hmm, that might be a misunderstanding on my end. I thought that it is > > the MCE handler to say whether the failure is recoverable or not. If yes > > then we can touch the content of the memory (that would imply the > > migration). Other than that both paths should be essentially the same, > > no? Well unrecoverable case would be essentially force migration failure > > path. > > > > MADV_HWPOISON is explicitly documented to test MCE handling IIUC: > > : This feature is intended for testing of memory error-handling > > : code; it is available only if the kernel was configured with > > : CONFIG_MEMORY_FAILURE. > > > > There is no explicit note about the type of the error that is injected > > but I think it is reasonably safe to assume this is a recoverable one. > > MADV_HWPOISON stands for hard-offline. > MADV_SOFT_OFFLINE stands for soft-offline. > > MADV_SOFT_OFFLINE (since Linux 2.6.33) > Soft offline the pages in the range specified by addr and > length. The memory of each page in the specified range is > preserved (i.e., when next accessed, the same content will be > visible, but in a new physical page frame), and the original > page is offlined (i.e., no longer used, and taken out of > normal memory management). The effect of the > MADV_SOFT_OFFLINE operation is invisible to (i.e., does not > change the semantics of) the calling process. > > This feature is intended for testing of memory error-handling > code; it is available only if the kernel was configured with > CONFIG_MEMORY_FAILURE. I have missed that one somehow. Thanks for pointing out. [...] > AFAICS, for hard-offline case, a recovered event would be if: > > - the page to shut down is already free > - the page was unmapped > > In some cases we need to kill the process if it holds dirty pages. Yes, I would expect that the page table would be poisoned and the process receive a SIGBUS when accessing that memory. > But we never migrate contents in hard-offline path. > I guess it is because we cannot really trust the contents anymore. Yes, that makes a perfect sense. What I am saying that the migration (aka trying to recover) is the main and only difference. The soft offline should poison page tables when not able to migrate as well IIUC. -- Michal Hocko SUSE Labs