Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed

Ryan Roberts <ryan.roberts@xxxxxxx> · Wed, 6 Mar 2024 13:42:06 +0000

Hi Matthew,

Afraid I have another bug for you...

On 27/02/2024 17:42, Matthew Wilcox (Oracle) wrote:
> Hugetlb folios still get special treatment, but normal large folios
> can now be freed by free_unref_folios().  This should have a reasonable
> performance impact, TBD.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx>

When running some swap tests with this change (which is in mm-stable) present, I see BadThings(TM). Usually I see a "bad page state" followed by a delay of a few seconds, followed by an oops or NULL pointer deref. Bisect points to this change, and if I revert it, the problem goes away.

Here is one example, running against mm-unstable (a7f399ae964e):

[   76.239466] BUG: Bad page state in process usemem  pfn:2554a0
[   76.240196] kernel BUG at include/linux/mm.h:1120!
[   76.240198] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[   76.240724]  dump_backtrace+0x98/0xf8
[   76.241523] Modules linked in:
[   76.241943]  show_stack+0x20/0x38
[   76.242282] 
[   76.242680]  dump_stack_lvl+0x48/0x60
[   76.242855] CPU: 2 PID: 62 Comm: kcompactd0 Not tainted 6.8.0-rc5-00456-ga7f399ae964e #16
[   76.243278]  dump_stack+0x18/0x28
[   76.244138] Hardware name: linux,dummy-virt (DT)
[   76.244510]  bad_page+0x88/0x128
[   76.244995] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   76.245370]  free_page_is_bad_report+0xa4/0xb8
[   76.246101] pc : migrate_folio_done+0x140/0x150
[   76.246572]  __free_pages_ok+0x370/0x4b0
[   76.247048] lr : migrate_folio_done+0x140/0x150
[   76.247489]  destroy_large_folio+0x94/0x108
[   76.247971] sp : ffff800083f5b8d0
[   76.248451]  __folio_put_large+0x70/0xc0
[   76.248807] x29: ffff800083f5b8d0
[   76.249256]  __folio_put+0xac/0xc0
[   76.249260]  deferred_split_scan+0x234/0x340
[   76.249607]  x28: 0000000000000000
[   76.249997]  do_shrink_slab+0x144/0x460
[   76.250444]  x27: ffff800083f5bb30
[   76.250829]  shrink_slab+0x2e0/0x4e0
[   76.251234] 
[   76.251604]  shrink_node+0x204/0x8a0
[   76.251979] x26: 0000000000000001
[   76.252147]  do_try_to_free_pages+0xd0/0x568
[   76.252527]  x25: 0000000000000010
[   76.252881]  try_to_free_mem_cgroup_pages+0x128/0x2d0
[   76.253337]  x24: fffffc0008552800
[   76.253687]  try_charge_memcg+0x12c/0x650
[   76.254219] 
[   76.254583]  __mem_cgroup_charge+0x6c/0xd0
[   76.255013] x23: ffff0000e6f353a8
[   76.255181]  __handle_mm_fault+0xe90/0x16a8
[   76.255624]  x22: ffff0013f5fa59c0
[   76.255977]  handle_mm_fault+0x70/0x2b0
[   76.256413]  x21: 0000000000000000
[   76.256756]  do_page_fault+0x100/0x4c0
[   76.257177] 
[   76.257540]  do_translation_fault+0xb4/0xd0
[   76.257932] x20: 0000000000000007
[   76.258095]  do_mem_abort+0x4c/0xa8
[   76.258532]  x19: fffffc0008552800
[   76.258883]  el0_da+0x2c/0x78
[   76.259263]  x18: 0000000000000010
[   76.259616]  el0t_64_sync_handler+0xe4/0x158
[   76.259933] 
[   76.260286]  el0t_64_sync+0x190/0x198
[   76.260729] x17: 3030303030303020 x16: 6666666666666666 x15: 3030303030303030
[   76.262010] x14: 0000000000000000 x13: 7465732029732867 x12: 616c662045455246
[   76.262746] x11: 5f54415f4b434548 x10: ffff800082e8bff8 x9 : ffff8000801276ac
[   76.263462] x8 : 00000000ffffefff x7 : ffff800082e8bff8 x6 : 0000000000000000
[   76.264182] x5 : ffff0013f5eb9d08 x4 : 0000000000000000 x3 : 0000000000000000
[   76.264903] x2 : 0000000000000000 x1 : ffff0000c105d640 x0 : 000000000000003e
[   76.265604] Call trace:
[   76.265865]  migrate_folio_done+0x140/0x150
[   76.266278]  migrate_pages_batch+0x9ec/0xff0
[   76.266716]  migrate_pages+0xd20/0xe20
[   76.267103]  compact_zone+0x7b4/0x1000
[   76.267460]  kcompactd_do_work+0x174/0x4d8
[   76.267869]  kcompactd+0x26c/0x418
[   76.268175]  kthread+0x120/0x130
[   76.268517]  ret_from_fork+0x10/0x20
[   76.268892] Code: aa1303e0 b000d161 9100c021 97fe0465 (d4210000) 
[   76.269447] ---[ end trace 0000000000000000 ]---
[   76.269893] note: kcompactd0[62] exited with irqs disabled
[   76.269942] page: refcount:0 mapcount:1 mapping:0000000000000000 index:0xffffbd0a0 pfn:0x2554a0
[   76.270483] note: kcompactd0[62] exited with preempt_count 1
[   76.271344] head: order:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
[   76.272521] flags: 0xbfffc0000080058(uptodate|dirty|head|swapbacked|node=0|zone=2|lastcpupid=0xffff)
[   76.273265] page_type: 0xffffffff()
[   76.273542] raw: 0bfffc0000080058 dead000000000100 dead000000000122 0000000000000000
[   76.274368] raw: 0000000ffffbd0a0 0000000000000000 00000000ffffffff 0000000000000000
[   76.275043] head: 0bfffc0000080058 dead000000000100 dead000000000122 0000000000000000
[   76.275651] head: 0000000ffffbd0a0 0000000000000000 00000000ffffffff 0000000000000000
[   76.276407] head: 0bfffc0000000000 0000000000000000 fffffc0008552848 0000000000000000
[   76.277064] head: 0000001000000000 0000000000000000 00000000ffffffff 0000000000000000
[   76.277784] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
[   76.278502] ------------[ cut here ]------------
[   76.278893] kernel BUG at include/linux/mm.h:1120!
[   76.279269] Internal error: Oops - BUG: 00000000f2000800 [#2] PREEMPT SMP
[   76.280144] Modules linked in:
[   76.280401] CPU: 6 PID: 1337 Comm: usemem Tainted: G    B D            6.8.0-rc5-00456-ga7f399ae964e #16
[   76.281214] Hardware name: linux,dummy-virt (DT)
[   76.281635] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   76.282256] pc : deferred_split_scan+0x2f0/0x340
[   76.282698] lr : deferred_split_scan+0x2f0/0x340
[   76.283082] sp : ffff80008681b830
[   76.283426] x29: ffff80008681b830 x28: ffff0000cd4fb3c0 x27: fffffc0008552800
[   76.284113] x26: 0000000000000001 x25: 00000000ffffffff x24: 0000000000000001
[   76.284914] x23: 0000000000000000 x22: fffffc0008552800 x21: ffff0000e9df7820
[   76.285590] x20: ffff80008681b898 x19: ffff0000e9df7818 x18: 0000000000000000
[   76.286271] x17: 0000000000000001 x16: 0000000000000001 x15: ffff0000c0617210
[   76.286927] x14: ffff0000c10b6558 x13: 0000000000000040 x12: 0000000000000228
[   76.287543] x11: 0000000000000040 x10: 0000000000000a90 x9 : ffff800080220ed8
[   76.288176] x8 : ffff0000cd4fbeb0 x7 : 0000000000000000 x6 : 0000000000000000
[   76.288842] x5 : ffff0013f5f35d08 x4 : 0000000000000000 x3 : 0000000000000000
[   76.289538] x2 : 0000000000000000 x1 : ffff0000cd4fb3c0 x0 : 000000000000003e
[   76.290201] Call trace:
[   76.290432]  deferred_split_scan+0x2f0/0x340
[   76.290856]  do_shrink_slab+0x144/0x460
[   76.291221]  shrink_slab+0x2e0/0x4e0
[   76.291513]  shrink_node+0x204/0x8a0
[   76.291831]  do_try_to_free_pages+0xd0/0x568
[   76.292192]  try_to_free_mem_cgroup_pages+0x128/0x2d0
[   76.292599]  try_charge_memcg+0x12c/0x650
[   76.292926]  __mem_cgroup_charge+0x6c/0xd0
[   76.293289]  __handle_mm_fault+0xe90/0x16a8
[   76.293713]  handle_mm_fault+0x70/0x2b0
[   76.294031]  do_page_fault+0x100/0x4c0
[   76.294343]  do_translation_fault+0xb4/0xd0
[   76.294694]  do_mem_abort+0x4c/0xa8
[   76.294968]  el0_da+0x2c/0x78
[   76.295202]  el0t_64_sync_handler+0xe4/0x158
[   76.295565]  el0t_64_sync+0x190/0x198
[   76.295860] Code: aa1603e0 d000d0e1 9100c021 97fdc715 (d4210000) 
[   76.296429] ---[ end trace 0000000000000000 ]---
[   76.296805] note: usemem[1337] exited with irqs disabled
[   76.297261] note: usemem[1337] exited with preempt_count 1

My test case is intended to stress swap:

  - Running in VM (on Ampere Altra) with 70 vCPUs and 80G RAM
  - Have a 35G block ram device (CONFIG_BLK_DEV_RAM & "brd.rd_nr=1 brd.rd_size=36700160")
  - the ramdisk is configured as the swap backend
  - run the test case in a memcg constrained to 40G (to force mem pressure)
  - test case has 70 processes, each allocating and writing 1G of RAM

swapoff -a
mkswap /dev/ram0
swapon -f /dev/ram0
cgcreate -g memory:/mmperfcgroup
echo 40G > /sys/fs/cgroup/mmperfcgroup/memory.max
cgexec -g memory:mmperfcgroup sudo -u $(whoami) bash

Then inside that second bash shell, run this script:

--8<---
function run_usemem_once {
        ./usemem -n 70 -O 1G | grep -v "free memory"
}

function run_usemem_multi {
        size=${1}
        for i in {1..2}; do
                echo "${size} THP ${i}"
                run_usemem_once
        done
}

echo never > /sys/kernel/mm/transparent_hugepage/hugepages-*/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
run_usemem_multi "64K"
--8<---

It will usually get through the first iteration of the loop in run_usemem_multi() and fail on the second. I've never seen it get all the way through both iterations.

"usemem" is from the vm-scalability suite. It just allocates and writes loads of anonymous memory (70 is concurrent processes, 1G is the amount of memory per process). Then the memory pressure from the cgroup causes lots of swap to happen.

> ---
>  mm/swap.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index dce5ea67ae05..6b697d33fa5b 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1003,12 +1003,13 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
>  		if (!folio_ref_sub_and_test(folio, nr_refs))
>  			continue;
>  
> -		if (folio_test_large(folio)) {
> +		/* hugetlb has its own memcg */
> +		if (folio_test_hugetlb(folio)) {

This still looks reasonable to me after re-review, so I have no idea what the problem is? I recall seeing some weird crashes when I looked at this original RFC, but didn't have time to debug at the time. I wonder if the root cause is the same.

If you find a smoking gun, I'm happy to test it if the above is too painful to reproduce.

Thanks,
Ryan

>  			if (lruvec) {
>  				unlock_page_lruvec_irqrestore(lruvec, flags);
>  				lruvec = NULL;
>  			}
> -			__folio_put_large(folio);
> +			free_huge_folio(folio);
>  			continue;
>  		}
>