On Fri, Jul 26, 2024 at 8:38 PM Baoquan He <bhe@xxxxxxxxxx> wrote: > > On 07/26/24 at 05:29pm, Barry Song wrote: > > On Fri, Jul 26, 2024 at 5:04 PM Hailong Liu <hailong.liu@xxxxxxxx> wrote: > > > > > > On Fri, 26. Jul 12:00, Hailong Liu wrote: > > > > On Fri, 26. Jul 10:31, Baoquan He wrote: > > > > [...] > > > > > > The logic of this patch is somewhat similar to my first one. If high order > > > > > > allocation fails, it will go normal mapping. > > > > > > > > > > > > However I also save the fallback position. The ones before this position are > > > > > > used for huge mapping, the ones >= position for normal mapping as Barry said. > > > > > > "support the combination of PMD and PTE mapping". this will take some > > > > > > times as it needs to address the corner cases and do some tests. > > > > > > > > > > Hmm, we may not need to worry about the imperfect mapping. Currently > > > > > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof() > > > > > and vmalloc_huge(). > > > > > > > > > > For vmalloc_huge(), it's called in below three interfaces which are all > > > > > invoked during boot. Basically they can succeed to get required contiguous > > > > > physical memory. I guess that's why Tangquan only spot this issue on kvmalloc > > > > > invocation when the required size exceeds e.g 2M. For kvmalloc_node(), > > > > > we have told that in the code comment above __kvmalloc_node_noprof(), > > > > > it's a best effort behaviour. > > > > > > > > > Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example. > > > > because the align requirement of huge. the real size is 4M. > > > > if allocation first order-9 successfully and the next failed. becuase the > > > > fallback, the layout out pages would be like order9 - 512 * order0 > > > > order9 support huge mapping, but order0 not. > > > > with the patch above, would call vmap_small_pages_range_noflush() and do normal > > > > mapping, the huge mapping would not exist. > > > > > > > > > mm/mm_init.c <<alloc_large_system_hash>> > > > > > table = vmalloc_huge(size, gfp_flags); > > > > > net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>> > > > > > new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket), > > > > > net/ipv4/udp.c <<udp_pernet_table_alloc>> > > > > > udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot) > > > > > > > > > > Maybe we should add code comment or document to notice people that the > > > > > contiguous physical pages are not guaranteed for vmalloc_huge() if you > > > > > use it after boot. > > > > > > > > > > > > > > > > > IMO, the draft can fix the current issue, it also does not have significant side > > > > > > effects. Barry, what do you think about this patch? If you think it's okay, > > > > > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the > > > > > > other to address the current mapping issue. > > > > > > > > > > > > -- > > > > > > help you, help me, > > > > > > Hailong. > > > > > > > > > > > > > > > > > > > I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL and > > > fallback to order 0, actuaally without this commit > > > e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations") > > > if __vmalloc_area_node allocation failed, it will goto fail and try order-0. > > > > > > fail: > > > if (shift > PAGE_SHIFT) { > > > shift = PAGE_SHIFT; > > > align = real_align; > > > size = real_size; > > > goto again; > > > } > > > > > > So do we really need fallback to order-0 if nofail? > > > > Good catch, this is what I missed. I feel we can revert Michal's fix. > > And just remove __GFP_NOFAIL bit when we are still allocating > > by high-order. When "goto again" happens, we will allocate by > > order-0, in this case, we keep the __GFP_NOFAIL. > > With Michal's patch, the fallback will be able to satisfy the allocation > for nofail case because it fallback to 0-order plus __GFP_NOFAIL. The My point is that vm_area_alloc_pages() is an internal function and just an implementation detail. As long as its caller, __vmalloc_area_node(), can support NOFAIL, it's fine. Therefore, we can skip NOFAIL support for high-order allocations in vm_area_alloc_pages() and limit GFP_NOFAIL support to order-0. Good news is that __vmalloc_node_range_noprof() has already a way to fallback to order-0 fail: if (shift > PAGE_SHIFT) { shift = PAGE_SHIFT; align = real_align; size = real_size; goto again; } So, we can definitely utilize this fallback instead of implementing it within vm_area_alloc_pages(), which would alter the page_order of vm_area and create inconsistency, crashing the system due to memory corruption. With higher-level fallback in __vmalloc_node_range_noprof(), we won't need an unusual fix-up for vm_area as you're proposing. The page_order of vm_area will consistently stay the same. If there is any way to improvement, we may also add: diff --git a/mm/vmalloc.c b/mm/vmalloc.c index caf032f0bd69..03d8148d7a02 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3806,7 +3806,7 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align, warn_alloc(gfp_mask, NULL, "vmalloc error: size %lu, vm_struct allocation failed%s", real_size, (nofail) ? ". Retrying." : ""); - if (nofail) { + if (nofail && shift == PAGE_SHIFT) { schedule_timeout_uninterruptible(1); goto again; } There's no need to keep retrying for __get_vm_area_node() until success, as we will succeed when we fall back to order-0. > 'if (shift > PAGE_SHIFT)' conditional checking and handling may be > problemtic since it could jump to fail becuase vmap_pages_range() > invocation failed, or partially allocate huge parges and break down, > then it will ignore the already allocated pages, and do all the thing again. > > The only thing 'if (shift > PAGE_SHIFT)' checking and handling makes > sense is it fallback to the real_size and real_align. BUT we need handle > the fail separately, e.g > 1)__get_vm_area_node() failed; > 2)vm_area_alloc_pages() failed when shift > PAGE_SHIFT and non-nofail; > 3)vmap_pages_range() failed; > > Honestly, I didn't see where the nofail is mishandled, could you point > it out specifically? I could miss it. > > Thanks > Baoquan > Thanks Barry