On Mon, 5 Apr 2021 13:50:18 +0000 HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@xxxxxxx> wrote: > On Fri, Apr 02, 2021 at 03:11:20PM +0000, Luck, Tony wrote: > > >> Combined with my "mutex" patch (to get rid of races where 2nd process returns > > >> early, but first process is still looking for mappings to unmap and tasks > > >> to signal) this patch moves forward a bit. But I think it needs an > > >> additional change here in kill_me_maybe() to just "return" if there is a > > >> EHWPOISON return from memory_failure() > > >> > > > Got this, Thanks for your reply! > > > I will dig into this! > > > > One problem with this approach is when the first task to find poison > > fails to complete actions. Then the poison pages are not unmapped, > > and just returning from kill_me_maybe() gets into a loop :-( > > Yes, that's the pain point. We need send SIGBUS to the current process in > "already haredware poisoned" case of memory_failure(). SIGBUS should > contain the error virtual address, but unfortunately walking the page table > or using p->mce_vaddr is not always reliable now. > > So as a second-best approach, we can extend the "walking page table" > approach such that we walk over the whole virtual address space to make sure > that the number of entries pointing to the error page is exactly 1. > If that's the case, then we can confidently send SIGBUS with it. If we find > multiple entries pointing to the error page, then we give up guessing, then > send a nomral SIGBUS to the current process. That's not worse than now, > and I think we need wait in the hope that the virtual address will be > available in MCE handler. > > Anyway I'll try to write a patch for this. Yeah, previous patch didn't adress the multiple virtual address issue, If there is a way to fix that, That would be great! -- Thanks! Aili Yao