On Tue, Aug 18, 2020 at 02:24:46PM +0200, Michal Hocko wrote: > On Tue 18-08-20 13:00:46, Oscar Salvador wrote: > > Recently a customer of ours experienced a crash when booting the > > system while enabling memory-hotplug. > > > > The problem is that Normal zones on different nodes don't get their private > > zone->pageset allocated, and keep sharing the initial boot_pageset. > > The sharing between zones is normally safe as explained by the comment for > > boot_pageset - it's a percpu structure, and manipulations are done with > > disabled interrupts, and boot_pageset is set up in a way that any page placed > > on its pcplist is immediately flushed to shared zone's freelist, because > > pcp->high == 1. > > However, the hotplug operation updates pcp->high to a higher value as it > > expects to be operating on a private pageset. > > > > The problem is in build_all_zonelists(), which is called when the first range > > of pages is onlined for the Normal zone of node X or Y: > > > > if (system_state == SYSTEM_BOOTING) { > > build_all_zonelists_init(); > > } else { > > #ifdef CONFIG_MEMORY_HOTPLUG > > if (zone) > > setup_zone_pageset(zone); > > #endif > > /* we have to stop all cpus to guarantee there is no user > > of zonelist */ > > stop_machine(__build_all_zonelists, pgdat, NULL); > > /* cpuset refresh routine should be here */ > > } > > > > When called during hotplug, it should execute the setup_zone_pageset(zone) > > which allocates the private pageset. > > However, with memhp_default_state=online, this happens early while > > system_state == SYSTEM_BOOTING is still true, hence this step is skipped. > > (and build_all_zonelists_init() is probably unsafe anyway at this point). > > > > Another hotplug operation on the same zone then leads to zone_pcp_update(zone) > > called from online_pages(), which updates the pcp->high for the shared > > boot_pageset to a value higher than 1. > > At that point, pages freed from Node X and Y Normal zones can end up on the same > > pcplist and from there they can be freed to the wrong zone's freelist, > > leading to the corruption and crashes. > > > > Please, note that upstream has fixed that differently (and unintentionally) by > > adding another boot state (SYSTEM_SCHEDULING), which is set before smp_init(). > > That should happen before memory hotplug events even with memhp_default_state=online. > > Backporting that would be too intrusive. > > > > Signed-off-by: Oscar Salvador <osalvador@xxxxxxx> > > Debugged-by: Vlastimil Babka <vbabka@xxxxxxx> > > Yes, I believe this is the easiest and the least scary way to fix the > issue for stable kernel users. Feel free to add > Acked-by: Michal Hocko <mhocko@xxxxxxxx> # for stable trees > > for that purpose. Now queued up, thanks! greg k-h