* Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> [220530 13:38]: > * Guenter Roeck <linux@xxxxxxxxxxxx> [220519 17:42]: > > On 5/19/22 07:35, Liam Howlett wrote: > > > * Guenter Roeck <linux@xxxxxxxxxxxx> [220517 10:32]: > > > > > > ... > > > > > > > > Another bisect result, boot failures with nommu targets (arm:mps2-an385, > > > > m68k:mcf5208evb). Bisect log is the same for both. > > > ... > > > > # first bad commit: [bd773a78705fb58eeadd80e5b31739df4c83c559] nommu: remove uses of VMA linked list > > > > > > I cannot reproduce this on my side, even with that specific commit. Can > > > you point me to the failure log, config file, etc? Do you still see > > > this with the fixes I've sent recently? > > > > > > > This was in linux-next; most recently with next-20220517. > > I don't know if that was up-to-date with your patches. > > The problem seems to be memory allocation failures. > > A sample log is at > > https://kerneltests.org/builders/qemu-m68k-next/builds/1065/steps/qemubuildcommand/logs/stdio > > The log history at > > https://kerneltests.org/builders/qemu-m68k-next?numbuilds=30 > > will give you a variety of logs. > > > > The configuration is derived from m5208evb_defconfig, with initrd > > and command line embedded in the image. You can see the detailed > > configuration updates at > > https://github.com/groeck/linux-build-test/blob/master/rootfs/m68k/run-qemu-m68k.sh > > > > Qemu command line is > > > > qemu-system-m68k -M mcf5208evb -kernel vmlinux \ > > -cpu m5208 -no-reboot -nographic -monitor none > > -append "rdinit=/sbin/init console=ttyS0,115200" > > > > with initrd from > > https://github.com/groeck/linux-build-test/blob/master/rootfs/m68k/rootfs-5208.cpio.gz > > > > I use qemu v6.2, but any recent qemu version should work. > > I have qemu 7.0 which seems to change the default memory size from 32MB > to 128MB. This can be seen on your log here: > > Memory: 27928K/32768K available (2827K kernel code, 160K rwdata, 432K rodata, 1016K init, 66K bss, 4840K reserved, 0K cma-reserved) > > With 128MB the kernel boots. With 64MB it also boots. 32MB fails with > an OOM. Looking into it more, I see that the OOM is caused by a > contiguous page allocation of 1MB (order 7 at 8K pages). This can be > seen in the log as well: > > Running sysctl: echo: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL), nodemask=(null) > ... > nommu: Allocation of length 884736 from process 63 (echo) failed > > This last log message above comes from the code path that uses > alloc_pages_exact(). > > I don't see why my 256 byte nodes (order 0 allocations yield 32 nodes) > would fragment the memory beyond use on boot. I have checked for some > sort of massive leak by adding a static node count to the code and have > only ever hit ~12 nodes. Consulting the OOM log from the above link > again: > > DMA: 0*8kB 1*16kB (U) 9*32kB (U) 7*64kB (U) 21*128kB (U) 7*256kB (U) 6*512kB (U) 0*1024kB 0*2048kB 0*4096kB 0*8192kB = 8304kB > > So to get to the point of breaking up a 1MB block, we'd need an obscene > number of nodes. > > Furthermore, the OOM on boot is not always happening. When boot > succeeds without an oom, I checked slabinfo and see that the maple_node > has 32 active objects which is 1 order 0 allocation. The boot does > mostly cause an OOM. It is worth noting that the slabinfo count is lazy > on counting the number of active objects so it is most likely lower than > this value in reality. > > Does anyone have any idea why nommu would be getting this fragmented? Answer: Why, yes. Matthew does. Using alloc_pages_exact() means we allocate the huge chunk of memory then free the leftovers immediately. Those freed leftover pages are handed out on the next request - which happens to be the maple tree. It seems nommu is so close to OOMing already that this makes a difference. Attached is a patch which _almost_ solves the issue by making it less likely to use those pages, but it's still a matter of timing on if this will OOM anyways. It reduces the potential by a large margin, maybe 1/10 fail instead of 4/5 failing. This patch is probably worth taking on its own as it reduces memory fragmentation on short-lived allocations that use alloc_pages_exact(). I changed the nommu code a bit to reduce memory usage as well. During a split even, I no longer delete then re-add the VMA and I only preallocate a single time for the two writes associated with a split. I also moved my pre-allocation ahead of the call path that does alloc_pages_exact(). This all but ensures we won't fragment the larger chunks of memory as we get enough nodes out of a single page to run at least through boot. However, the failure rate remained at 1/10 with this change. I had accepted the scenario that this all just worked before, but my setup is different than that of Guenter. I am using buildroot-2022.02.1 and qemu 7.0 for my testing. My configuration OOMs 12/13 times without maple tree, so I think we actually lowered the memory pressure on boot with these changes. Obviously there is a element of timing that causes variation in the testing so exact numbers are not possible. Thanks, Liam
From abef6d264d2413a625670bdb873133576d5cce5f Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx> Date: Tue, 31 May 2022 09:20:51 -0400 Subject: [PATCH] mm/page_alloc: Reduce potential fragmentation in make_alloc_exact() Try to avoid using the left over split page on the next request for a page by calling __free_pages_ok() with FPI_TO_TAIL. This increases the potential of defragmenting memory when it's used for a short period of time. Suggested-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx> Signed-off-by: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> --- mm/page_alloc.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f01c71e41bcf..8b6d6cada684 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5580,14 +5580,18 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order, size_t size) { if (addr) { - unsigned long alloc_end = addr + (PAGE_SIZE << order); - unsigned long used = addr + PAGE_ALIGN(size); - - split_page(virt_to_page((void *)addr), order); - while (used < alloc_end) { - free_page(used); - used += PAGE_SIZE; - } + unsigned long nr = DIV_ROUND_UP(size, PAGE_SIZE); + struct page *page = virt_to_page((void *)addr); + struct page *last = page + nr; + + split_page_owner(page, 1 << order); + split_page_memcg(page, 1 << order); + while (page < --last) + set_page_refcounted(last); + + last = page + (1UL << order); + for (page += nr; page < last; page++) + __free_pages_ok(page, 0, FPI_TO_TAIL); } return (void *)addr; } -- 2.35.1