On Fri, Jan 21, 2022 at 10:59:14AM -0800, Mike Kravetz wrote: > On 1/20/22 16:09, Mike Kravetz wrote: > > On 11/15/21 00:40, Naoya Horiguchi wrote: > >> From: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > >> > >> Originally mf_mutex is introduced to serialize multiple MCE events, but > >> it is not that useful to allow unpoison to run in parallel with memory_failure() > >> and soft offline. So apply mf_mutex to soft offline and unpoison. > >> The memory failure handler and soft offline handler get simpler with this. > >> > > Sorry for the late question. > > It is not directly part of this change, but can we also remove the check in > __soft_offline_page after this comment? > > /* > * Check PageHWPoison again inside page lock because PageHWPoison > * is set by memory_failure() outside page lock. Note that > * memory_failure() also double-checks PageHWPoison inside page lock, > * so there's no race between soft_offline_page() and memory_failure(). > */ You're right. This comment is obsolete. I'll send a patch later. And you pointed out some important points in the previous email, thank you. > There are only two places other places where PageHWPoison is modified without > the mutex. They are: > - test_and_clear_pmem_poison > I 'think' pmem error handling is done separately so this does not apply. Yes, soft-offline nor unpoison should work for pmem error, so the relevant race should not be problematic. > - clear_hwpoisoned_pages > Called before removing memory (and deleting memmap) to reconcile count > of poisoned pages. Should not be an issue and technically I do not think > the ClearPageHWPoison() is actually needed in this routine. Right, this ClearPageHWPoison() might be unnecessary because struct pages are freed soon after. But this function also updates num_poisoned_pages, and the race could break the counter. I'll check the effect of the race with hotremove and how mf_mutex can solve it. Thanks, Naoya Horiguchi