Hugepage program taking forever to exit

Jens Axboe <axboe@xxxxxxxxx> · Tue, 10 Sep 2024 12:21:42 -0600

Hi,

Investigating another issue, I wrote the following simple program that allocates
and faults in 500 1GB huge pages, and then registers them with io_uring. Each
step is timed:

Got 500 huge pages (each 1024MB) in 0 msec
Faulted in 500 huge pages in 38632 msec
Registered 500 pages in 867 msec

and as expected, faulting in the pages takes (by far) the longest. From
the above, you'd also expect the total runtime to be around ~39 seconds.
But it is not... In fact it takes 82 seconds in total for this program
to have exited. Looking at why, I see:

[<0>] __wait_rcu_gp+0x12b/0x160
[<0>] synchronize_rcu_normal.part.0+0x2a/0x30
[<0>] hugetlb_vmemmap_restore_folios+0x22/0xe0
[<0>] update_and_free_pages_bulk+0x4c/0x220
[<0>] return_unused_surplus_pages+0x80/0xa0
[<0>] hugetlb_acct_memory.part.0+0x2dd/0x3b0
[<0>] hugetlb_vm_op_close+0x160/0x180
[<0>] remove_vma+0x20/0x60
[<0>] exit_mmap+0x199/0x340
[<0>] mmput+0x49/0x110
[<0>] do_exit+0x261/0x9b0
[<0>] do_group_exit+0x2c/0x80
[<0>] __x64_sys_exit_group+0x14/0x20
[<0>] x64_sys_call+0x714/0x720
[<0>] do_syscall_64+0x5b/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

and yes, it does look like the program is mostly idle for most of the
time while returning these huge pages. It's also telling us exactly why
we're just sitting idle - RCU grace period.

The below quick change means the runtime of the program is pretty much
just the time it takes to execute the parts of it, as you can see from
the full output after the change:

axboe@r7525 ~> time sudo ./reg-huge
Got 500 huge pages (each 1024MB) in 0 msec
Faulted in 500 huge pages in 38632 msec
Registered 500 pages in 867 msec

________________________________________________________
Executed in   39.53 secs      fish           external
   usr time    4.88 millis  238.00 micros    4.64 millis
   sys time    0.00 millis    0.00 micros    0.00 millis

where 38632+876 == 39.51s.

Looks like this was introduced by:

commit bd225530a4c717714722c3731442b78954c765b3
Author: Yu Zhao <yuzhao@xxxxxxxxxx>
Date:   Thu Jun 27 16:27:05 2024 -0600

    mm/hugetlb_vmemmap: fix race with speculative PFN walkers

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 0c3f56b3578e..95f6ad8f8232 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -517,7 +517,7 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 	long ret = 0;
 
 	/* avoid writes from page_ref_add_unless() while unfolding vmemmap */
-	synchronize_rcu();
+	synchronize_rcu_expedited();
 
 	list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
 		if (folio_test_hugetlb_vmemmap_optimized(folio)) {


-- 
Jens Axboe