On 12.02.20 08:31, Baoquan He wrote: > On 02/11/20 at 04:41pm, Andrew Morton wrote: >> On Tue, 11 Feb 2020 07:07:41 +0800 Wei Yang <richardw.yang@xxxxxxxxxxxxxxx> wrote: >> >>> On Mon, Feb 10, 2020 at 02:15:51PM +0800, Baoquan He wrote: >>>> On 02/10/20 at 02:09pm, Baoquan He wrote: >>>>> On 02/09/20 at 09:56pm, Andrew Morton wrote: >>>>>> On Mon, 10 Feb 2020 13:40:27 +0800 Baoquan He <bhe@xxxxxxxxxx> wrote: >>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> On 02/09/20 at 09:32pm, Andrew Morton wrote: >>>>>>>> On Tue, 04 Feb 2020 11:25:48 +0000 bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote: >>>>>>>> >>>>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=206401 >>>>>>>>> >>>>>>>> >>>>>>>> An oops during mem hotadd. Could someone please take a look when >>>>>>>> convenient? >>>>>>> >>>>>>> This has been addressed by Wei Yang's patch, please check it here: >>>>>>> >>>>>>> http://lkml.kernel.org/r/20200209104826.3385-7-bhe@xxxxxxxxxx >>>>>>> >>>>>> >>>>>> hm, OK, thanks. It's unfortunate that a 5.5 fix is buried in a >>>>>> six-patch series which is still in progress! Can we please merge that >>>>>> as a standalone fix with a cc:stable, Fixes:, etc? >>>> >>>> Maybe can add Fixes tag as follow when merge: >>>> >>>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") >>>> >> >> The reporter (cc'ed here) is still seeing issues: >> https://bugzilla.kernel.org/show_bug.cgi?id=206401 >> >> Could we please continue this investigation via emailed reply-to-all, >> rather than via the bugzilla interface? > > Yes, people prefer mailing list to discuss issues. > > Hi T.Kabe, > > Could you provide the call trace again after below patch is applied? > The comment #9 in bugzilla is not very clear to me. > > mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM > http://lkml.kernel.org/r/20200209104826.3385-7-bhe@xxxxxxxxxx > > And, as you said, applying above patch, and do not call > __free_pages_core() in generic_online_page() will work. I doubt it, > because without __free_pages_core(), your added pages are not added > into buddy for managing. Removing __free_pages_core() from generic_online_page() is just plain wrong and would break memory hotplug in general. So that is certainly not the right fix. HV supports memory sections that are fully added, but only parts of it are actually backed in the hypervisor, "online" and exposed to the buddy. When onlining memory, it will online the backed parts via hv_online_page()->generic_online_page(). When requested to hot add more memory, the guest will online remaining parts that are now backed handle_pg_range()->hv_bring_pgs_online(). So if generic_online_page() fails it's either because 1. HV guest driver has a bug and tries to online something it shouldn't 2. HV hypervisor has a bug and does not back memory properly before hot/adding 3. Memory hotplug code has a bug and does not properly add the memory block/sections Please note that to using generic_online_page() in commit 30a9c246b9f6fe0591e8afb05758a3e3b096fabe Author: David Hildenbrand <david@xxxxxxxxxx> Date: Sat Nov 30 17:53:55 2019 -0800 hv_balloon: use generic_online_page() Let's use the generic onlining function - which will now also take care of calling kernel_map_pages(). However, the old code ended up calling __free_pages_core() -> __free_pages() End the new one ends up calling __online_page_free() -> __free_reserved_page() -> __free_page() So I don't think it's related to that. Especially, looking at the kernel messages, I can see that the kernel crashes when adding memory, not when onlining it? So I do think there is still something wrong in the SPARSE hot-add code if you keep seeing issues. -- Thanks, David / dhildenb