Hi, Investigating another issue, I wrote the following simple program that allocates and faults in 500 1GB huge pages, and then registers them with io_uring. Each step is timed: Got 500 huge pages (each 1024MB) in 0 msec Faulted in 500 huge pages in 38632 msec Registered 500 pages in 867 msec and as expected, faulting in the pages takes (by far) the longest. From the above, you'd also expect the total runtime to be around ~39 seconds. But it is not... In fact it takes 82 seconds in total for this program to have exited. Looking at why, I see: [<0>] __wait_rcu_gp+0x12b/0x160 [<0>] synchronize_rcu_normal.part.0+0x2a/0x30 [<0>] hugetlb_vmemmap_restore_folios+0x22/0xe0 [<0>] update_and_free_pages_bulk+0x4c/0x220 [<0>] return_unused_surplus_pages+0x80/0xa0 [<0>] hugetlb_acct_memory.part.0+0x2dd/0x3b0 [<0>] hugetlb_vm_op_close+0x160/0x180 [<0>] remove_vma+0x20/0x60 [<0>] exit_mmap+0x199/0x340 [<0>] mmput+0x49/0x110 [<0>] do_exit+0x261/0x9b0 [<0>] do_group_exit+0x2c/0x80 [<0>] __x64_sys_exit_group+0x14/0x20 [<0>] x64_sys_call+0x714/0x720 [<0>] do_syscall_64+0x5b/0x160 [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 and yes, it does look like the program is mostly idle for most of the time while returning these huge pages. It's also telling us exactly why we're just sitting idle - RCU grace period. The below quick change means the runtime of the program is pretty much just the time it takes to execute the parts of it, as you can see from the full output after the change: axboe@r7525 ~> time sudo ./reg-huge Got 500 huge pages (each 1024MB) in 0 msec Faulted in 500 huge pages in 38632 msec Registered 500 pages in 867 msec ________________________________________________________ Executed in 39.53 secs fish external usr time 4.88 millis 238.00 micros 4.64 millis sys time 0.00 millis 0.00 micros 0.00 millis where 38632+876 == 39.51s. Looks like this was introduced by: commit bd225530a4c717714722c3731442b78954c765b3 Author: Yu Zhao <yuzhao@xxxxxxxxxx> Date: Thu Jun 27 16:27:05 2024 -0600 mm/hugetlb_vmemmap: fix race with speculative PFN walkers diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 0c3f56b3578e..95f6ad8f8232 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -517,7 +517,7 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h, long ret = 0; /* avoid writes from page_ref_add_unless() while unfolding vmemmap */ - synchronize_rcu(); + synchronize_rcu_expedited(); list_for_each_entry_safe(folio, t_folio, folio_list, lru) { if (folio_test_hugetlb_vmemmap_optimized(folio)) { -- Jens Axboe