On 18.08.20 13:00, Oscar Salvador wrote: > Recently a customer of ours experienced a crash when booting the > system while enabling memory-hotplug. > > The problem is that Normal zones on different nodes don't get their private > zone->pageset allocated, and keep sharing the initial boot_pageset. > The sharing between zones is normally safe as explained by the comment for > boot_pageset - it's a percpu structure, and manipulations are done with > disabled interrupts, and boot_pageset is set up in a way that any page placed > on its pcplist is immediately flushed to shared zone's freelist, because > pcp->high == 1. > However, the hotplug operation updates pcp->high to a higher value as it > expects to be operating on a private pageset. > > The problem is in build_all_zonelists(), which is called when the first range > of pages is onlined for the Normal zone of node X or Y: > > if (system_state == SYSTEM_BOOTING) { > build_all_zonelists_init(); > } else { > #ifdef CONFIG_MEMORY_HOTPLUG > if (zone) > setup_zone_pageset(zone); > #endif > /* we have to stop all cpus to guarantee there is no user > of zonelist */ > stop_machine(__build_all_zonelists, pgdat, NULL); > /* cpuset refresh routine should be here */ > } > > When called during hotplug, it should execute the setup_zone_pageset(zone) > which allocates the private pageset. > However, with memhp_default_state=online, this happens early while > system_state == SYSTEM_BOOTING is still true, hence this step is skipped. > (and build_all_zonelists_init() is probably unsafe anyway at this point). > > Another hotplug operation on the same zone then leads to zone_pcp_update(zone) > called from online_pages(), which updates the pcp->high for the shared > boot_pageset to a value higher than 1. > At that point, pages freed from Node X and Y Normal zones can end up on the same > pcplist and from there they can be freed to the wrong zone's freelist, > leading to the corruption and crashes. > > Please, note that upstream has fixed that differently (and unintentionally) by > adding another boot state (SYSTEM_SCHEDULING), which is set before smp_init(). > That should happen before memory hotplug events even with memhp_default_state=online. > Backporting that would be too intrusive. > > Signed-off-by: Oscar Salvador <osalvador@xxxxxxx> > Debugged-by: Vlastimil Babka <vbabka@xxxxxxx> So, we have ACPI running and already adding DIMMs while booting? Crazy. Looks sane to me. Thanks! -- Thanks, David / dhildenb