On 26.02.22 10:40, Miaohe Lin wrote: > There is a theoretical race window between memory failure and memory > offline. Think about the below scene: > > CPU A CPU B > memory_failure offline_pages > mutex_lock(&mf_mutex); > TestSetPageHWPoison(p) > start_isolate_page_range > has_unmovable_pages > --PageHWPoison is movable > do { > scan_movable_pages > do_migrate_range > --PageHWPoison isn't migrated > } > test_pages_isolated > --PageHWPoison is isolated > remove_memory > access page... bang > ... I think the motivation for the offlining code was to not block memory hotunplug (especially on ZONE_MOVABLE) just because there is a HWpoisoned page. But how often does that happen? It's all semi-broken either way. Assume you just offlined a memory block with a hwpoisoned page. The memmap is stale and the information about hwpoison is lost. You can happily re-online that memory block and use *all* memory, including previously hwpoisoned memory. Note that this used to be different in the past, when the memmap was initialized when adding memory, not when onlining that memory. IMHO, we should stop special casing hwpoison. Either fail offlining completely if we stumble over a hwpoisoned page, or allow offlining only if the refcount==0 -- just as any other page. -- Thanks, David / dhildenb