Re: Hugepage program taking forever to exit

Johannes Weiner <hannes@xxxxxxxxxxx> · Tue, 10 Sep 2024 15:33:42 -0400

On Tue, Sep 10, 2024 at 12:21:42PM -0600, Jens Axboe wrote:
> Hi,
> 
> Investigating another issue, I wrote the following simple program that allocates
> and faults in 500 1GB huge pages, and then registers them with io_uring. Each
> step is timed:
> 
> Got 500 huge pages (each 1024MB) in 0 msec
> Faulted in 500 huge pages in 38632 msec
> Registered 500 pages in 867 msec
> 
> and as expected, faulting in the pages takes (by far) the longest. From
> the above, you'd also expect the total runtime to be around ~39 seconds.
> But it is not... In fact it takes 82 seconds in total for this program
> to have exited. Looking at why, I see:
> 
> [<0>] __wait_rcu_gp+0x12b/0x160
> [<0>] synchronize_rcu_normal.part.0+0x2a/0x30
> [<0>] hugetlb_vmemmap_restore_folios+0x22/0xe0
> [<0>] update_and_free_pages_bulk+0x4c/0x220
> [<0>] return_unused_surplus_pages+0x80/0xa0
> [<0>] hugetlb_acct_memory.part.0+0x2dd/0x3b0
> [<0>] hugetlb_vm_op_close+0x160/0x180
> [<0>] remove_vma+0x20/0x60
> [<0>] exit_mmap+0x199/0x340
> [<0>] mmput+0x49/0x110
> [<0>] do_exit+0x261/0x9b0
> [<0>] do_group_exit+0x2c/0x80
> [<0>] __x64_sys_exit_group+0x14/0x20
> [<0>] x64_sys_call+0x714/0x720
> [<0>] do_syscall_64+0x5b/0x160
> [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

Yeah, this looks wrong to me:

void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
{
	struct folio *folio;
	LIST_HEAD(vmemmap_pages);

	list_for_each_entry(folio, folio_list, lru) {
		int ret = hugetlb_vmemmap_split_folio(h, folio);

		/*
		 * Spliting the PMD requires allocating a page, thus lets fail
		 * early once we encounter the first OOM. No point in retrying
		 * as it can be dynamically done on remap with the memory
		 * we get back from the vmemmap deduplication.
		 */
		if (ret == -ENOMEM)
			break;
	}

	flush_tlb_all();

	/* avoid writes from page_ref_add_unless() while folding vmemmap */
	synchronize_rcu();

	list_for_each_entry(folio, folio_list, lru) {
		int ret;

		ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
						       VMEMMAP_REMAP_NO_TLB_FLUSH);

		/*
		 * Pages to be freed may have been accumulated.  If we
		 * encounter an ENOMEM,  free what we have and try again.
		 * This can occur in the case that both spliting fails
		 * halfway and head page allocation also failed. In this
		 * case __hugetlb_vmemmap_optimize_folio() would free memory
		 * allowing more vmemmap remaps to occur.
		 */
		if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
			flush_tlb_all();
			free_vmemmap_page_list(&vmemmap_pages);
			INIT_LIST_HEAD(&vmemmap_pages);
			__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
							 VMEMMAP_REMAP_NO_TLB_FLUSH);
		}
	}

	flush_tlb_all();
	free_vmemmap_page_list(&vmemmap_pages);
}

If you don't have HVO enabled, then hugetlb_vmemmap_split_folio() does
nothing. And __hugetlb_vmemmap_optimize_folio() also does nothing,
leaving &vmemmap_pages empty and free_vmemmap_page_list() a nop.

So what's left is: it flushes the TLB twice and waits for RCU. What
for exactly?

The same is true for hugetlb_vmemmap_optimize_folio() and the
corresponding split function, which waits for RCU on every page being
allocated and freed, even if the vmemmap is left alone.

Surely all those RCU waits and tlb flushes should be guarded by
whether the HVO is actually enabled, no?