Michael Ellerman <mpe@xxxxxxxxxxxxxx> writes: > Vlastimil Babka <vbabka@xxxxxxx> writes: >> On 3/18/20 11:02 AM, Michal Hocko wrote: >>> On Wed 18-03-20 12:58:07, Srikar Dronamraju wrote: >>>> Calling a kmalloc_node on a possible node which is not yet onlined can >>>> lead to panic. Currently node_present_pages() doesn't verify the node is >>>> online before accessing the pgdat for the node. However pgdat struct may >>>> not be available resulting in a crash. >>>> >>>> NIP [c0000000003d55f4] ___slab_alloc+0x1f4/0x760 >>>> LR [c0000000003d5b94] __slab_alloc+0x34/0x60 >>>> Call Trace: >>>> [c0000008b3783960] [c0000000003d5734] ___slab_alloc+0x334/0x760 (unreliable) >>>> [c0000008b3783a40] [c0000000003d5b94] __slab_alloc+0x34/0x60 >>>> [c0000008b3783a70] [c0000000003d6fa0] __kmalloc_node+0x110/0x490 >>>> [c0000008b3783af0] [c0000000003443d8] kvmalloc_node+0x58/0x110 >>>> [c0000008b3783b30] [c0000000003fee38] mem_cgroup_css_online+0x108/0x270 >>>> [c0000008b3783b90] [c000000000235aa8] online_css+0x48/0xd0 >>>> [c0000008b3783bc0] [c00000000023eaec] cgroup_apply_control_enable+0x2ec/0x4d0 >>>> [c0000008b3783ca0] [c000000000242318] cgroup_mkdir+0x228/0x5f0 >>>> [c0000008b3783d10] [c00000000051e170] kernfs_iop_mkdir+0x90/0xf0 >>>> [c0000008b3783d50] [c00000000043dc00] vfs_mkdir+0x110/0x230 >>>> [c0000008b3783da0] [c000000000441c90] do_mkdirat+0xb0/0x1a0 >>>> [c0000008b3783e20] [c00000000000b278] system_call+0x5c/0x68 >>>> >>>> Fix this by verifying the node is online before accessing the pgdat >>>> structure. Fix the same for node_spanned_pages() too. >>>> >>>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >>>> Cc: linux-mm@xxxxxxxxx >>>> Cc: Mel Gorman <mgorman@xxxxxxx> >>>> Cc: Michael Ellerman <mpe@xxxxxxxxxxxxxx> >>>> Cc: Sachin Sant <sachinp@xxxxxxxxxxxxxxxxxx> >>>> Cc: Michal Hocko <mhocko@xxxxxxxxxx> >>>> Cc: Christopher Lameter <cl@xxxxxxxxx> >>>> Cc: linuxppc-dev@xxxxxxxxxxxxxxxx >>>> Cc: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> >>>> Cc: Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> >>>> Cc: Vlastimil Babka <vbabka@xxxxxxx> >>>> Cc: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx> >>>> Cc: Bharata B Rao <bharata@xxxxxxxxxxxxx> >>>> Cc: Nathan Lynch <nathanl@xxxxxxxxxxxxx> >>>> >>>> Reported-by: Sachin Sant <sachinp@xxxxxxxxxxxxxxxxxx> >>>> Tested-by: Sachin Sant <sachinp@xxxxxxxxxxxxxxxxxx> >>>> Signed-off-by: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx> >>>> --- >>>> include/linux/mmzone.h | 6 ++++-- >>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >>>> index f3f264826423..88078a3b95e5 100644 >>>> --- a/include/linux/mmzone.h >>>> +++ b/include/linux/mmzone.h >>>> @@ -756,8 +756,10 @@ typedef struct pglist_data { >>>> atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; >>>> } pg_data_t; >>>> >>>> -#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) >>>> -#define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages) >>>> +#define node_present_pages(nid) \ >>>> + (node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0) >>>> +#define node_spanned_pages(nid) \ >>>> + (node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0) >>> >>> I believe this is a wrong approach. We really do not want to special >>> case all the places which require NODE_DATA. Can we please go and >>> allocate pgdat for all possible nodes? >>> >>> The current state of memory less hacks subtle bugs poping up here and >>> there just prove that we should have done that from the very begining >>> IMHO. >> >> Yes. So here's an alternative proposal for fixing the current situation in SLUB, >> before the long-term solution of having all possible nodes provide valid pgdat >> with zonelists: >> >> - fix SLUB with the hunk at the end of this mail - the point is to use NUMA_NO_NODE >> as fallback instead of node_to_mem_node() >> - this removes all uses of node_to_mem_node (luckily it's just SLUB), >> kill it completely instead of trying to fix it up >> - patch 1/4 is not needed with the fix >> - perhaps many of your other patches are alss not needed >> - once we get the long-term solution, some of the !node_online() checks can be removed > > Seems like a nice solution to me :) > >> ----8<---- >> diff --git a/mm/slub.c b/mm/slub.c >> index 17dc00e33115..1d4f2d7a0080 100644 >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -1511,7 +1511,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s, >> struct page *page; >> unsigned int order = oo_order(oo); >> >> - if (node == NUMA_NO_NODE) >> + if (node == NUMA_NO_NODE || !node_online(node)) > > Why don't we need the node_present_pages() check here? > >> page = alloc_pages(flags, order); >> else >> page = __alloc_pages_node(node, flags, order); >> @@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, >> >> if (node == NUMA_NO_NODE) >> searchnode = numa_mem_id(); >> - else if (!node_present_pages(node)) >> - searchnode = node_to_mem_node(node); >> >> object = get_partial_node(s, get_node(s, searchnode), c, flags); >> if (object || node != NUMA_NO_NODE) >> @@ -2568,12 +2566,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, >> redo: >> >> if (unlikely(!node_match(page, node))) { >> - int searchnode = node; >> - >> - if (node != NUMA_NO_NODE && !node_present_pages(node)) >> - searchnode = node_to_mem_node(node); >> - >> - if (unlikely(!node_match(page, searchnode))) { >> + /* >> + * node_match() false implies node != NUMA_NO_NODE >> + * but if the node is not online and has no pages, just > ^ > this should be 'or' ? Sorry I see you've already fixed this in the version you posted. cheers