On Mon 21-01-19 12:58:46, Qian Cai wrote: > > > On 1/21/19 11:38 AM, Qian Cai wrote: > > > > > > On 1/21/19 4:53 AM, Michal Hocko wrote: > >> On Thu 17-01-19 21:16:50, Qian Cai wrote: > >>> On an arm64 ThunderX2 server, the first kmemleak scan would crash [1] > >>> with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is > >>> not directly mapped (MEMBLOCK_NOMAP). Hence, the page->flags is > >>> uninitialized. > >>> > >>> This is due to the commit 9f1eb38e0e11 ("mm, kmemleak: little > >>> optimization while scanning") starts to use pfn_to_online_page() instead > >>> of pfn_valid(). However, in the CONFIG_MEMORY_HOTPLUG=y case, > >>> pfn_to_online_page() does not call memblock_is_map_memory() while > >>> pfn_valid() does. > >> > >> How come there is an online section which has an pfn_valid==F? We do > >> allocate the full section worth of struct pages so there is a valid > >> struct page. Is there any hole inside this section? > > > > It has CONFIG_HOLES_IN_ZONE=y. > > Actually, this does not seem have anything to do with holes. > > 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping > > This causes pages marked as nomap being no long reassigned to the new zone in > memmap_init_zone() by calling __init_single_page(). Thanks for the pointer. This sched some light but I cannot say I would understand all the details. > There is an old discussion for this topic. > https://lkml.org/lkml/2016/11/30/566 Hmm, I see. The documentation is not the best (mea culpa) * Return page for the valid pfn only if the page is online. All pfn * walkers which rely on the fully initialized page->flags and others * should use this rather than pfn_valid && pfn_to_page This suggests that the pfn is _valid_ when using pfn_to_online_page and some callers indeed do so. Some of them don't though which is probably because the later part of the documentation suggests that it should replace pfn_valid & pfn_to_page. Thinking about this more, I guess we do not want to put an additional burden on callers and require pfn_valid to be called as well. This is just error prone and can lead to problems like this one. So I agree with your change (modulo the range check) but please make sure to make all this information to the changelog. Thanks! -- Michal Hocko SUSE Labs