Hi Peter, I have an alternative that is based off your patches. I like the overall refactoring but was a little uncomfortable with having to add a custom enum map_chg_state. The custom enum centralizes the possible states, but the states still need to be interpreted again throughout the function to take action (e.g. having checks like if (map_chg != MAP_CHG_ENFORCED) and if (map_chg == MAP_CHG_NEEDED)), and I feel that shifts the problem later to understanding the interpretation of the states. I switched back to something close to avoid_reserve, but improved on that name with your comments, by calling it bypass_vma_reservations. I feel that calling it cow_from_owner kind of lets the implementation detail of the CoW use-case bleed into this function. "bypass_vma_reservations" is named so that it requests this function, alloc_hugetlb_folio(), to bypass the resv_map. All parts of alloc_hugetlb_folio() that update the resv_map are guarded by the bypass_vma_reservations flag. This alternative proposes to use the following booleans local to alloc_hugetlb_folio(). 1. vma_reservation_exists vma_reservation_exists represents whether a reservation exists in the resv_map and is used. vma_reservation_exists defaults to false, and when bypass_vma_reservations is not requested, the resv_map will be consulted to see if vma_reservation_exists. vma_reservation_exists is also used to store the result of the initial resv_map check, to correctly fix up a race later on. 2. debit_subpool If a vma_reservation_exists, then this allocation has been reserved in both the resv_map and subpool, so skip debiting the subpool. If alloc_hugetlb_folio() was requested to bypass_vma_reservations, the subpool should still be charged, so debit_subpool is set. And if debit_subpool, then proceed to do hugepage_subpool_get_pages() and set up subpool_reservation_exists. Later on, debit_subpool is used for the matching cleanup in the error case. 3. subpool_reservation_exists subpool_reservation_exists represents whether a reservation exists in the resv_map and is used, analogous to vma_reservation_exists, and the subpool is only checked if debit_subpool is set. If debit_subpool is not set, vma_reservation_exists determines whether a subpool_reservation_exists. subpool_reservation_exists is then used to guard decrementing h->resv_huge_pages, which fixes the bug you found. 4. charge_cgroup_rsvd This has the same condition as debit_subpool, but has a separate variable for readability. Later on, charge_cgroup_rsvd is used to determine whether to commit the charge, or whether to fix up the charge in error cases. I also refactored dequeue_hugetlb_folio_vma() to dequeue_hugetlb_folio_with_mpol() to align with alloc_buddy_hugetlb_folio_with_mpol() to avoid passing gbl_chg into the function. gbl_chg is interpreted in alloc_hugetlb_folio() instead. If subpool_reservation_exists, then try to get a folio by dequeueing it. If a subpool reservation does not exist, make sure there are available pages before dequeueing. If there was no folio from dequeueing for whatever reason, allocate a new folio. This could probably be a separate patch but I'd like to hear your thoughts before doing integration/cleaning up. These changes have been tested with your reproducer, and here's the test output from libhugetlbfs test cases: ********** TEST SUMMARY * 2M * 32-bit 64-bit * Total testcases: 82 85 * Skipped: 9 9 * PASS: 72 69 * FAIL: 0 0 * Killed by signal: 1 7 * Bad configuration: 0 0 * Expected FAIL: 0 0 * Unexpected PASS: 0 0 * Test not present: 0 0 * Strange test result: 0 0 ********** Ackerley --- mm/hugetlb.c | 186 ++++++++++++++++++++++++++------------------------- 1 file changed, 94 insertions(+), 92 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6a0ea28f5bac..2cd588d35984 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1333,9 +1333,9 @@ static unsigned long available_huge_pages(struct hstate *h) return h->free_huge_pages - h->resv_huge_pages; } -static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, - struct vm_area_struct *vma, - unsigned long address, long gbl_chg) +static struct folio *dequeue_hugetlb_folio_with_mpol(struct hstate *h, + struct vm_area_struct *vma, + unsigned long address) { struct folio *folio = NULL; struct mempolicy *mpol; @@ -1343,13 +1343,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, nodemask_t *nodemask; int nid; - /* - * gbl_chg==1 means the allocation requires a new page that was not - * reserved before. Making sure there's at least one free page. - */ - if (gbl_chg && !available_huge_pages(h)) - goto err; - gfp_mask = htlb_alloc_mask(h); nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask); @@ -1367,9 +1360,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, mpol_cond_put(mpol); return folio; - -err: - return NULL; } /* @@ -2943,91 +2933,83 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) return ret; } -typedef enum { - /* - * For either 0/1: we checked the per-vma resv map, and one resv - * count either can be reused (0), or an extra needed (1). - */ - MAP_CHG_REUSE = 0, - MAP_CHG_NEEDED = 1, - /* - * Cannot use per-vma resv count can be used, hence a new resv - * count is enforced. - * - * NOTE: This is mostly identical to MAP_CHG_NEEDED, except - * that currently vma_needs_reservation() has an unwanted side - * effect to either use end() or commit() to complete the - * transaction. Hence it needs to differenciate from NEEDED. - */ - MAP_CHG_ENFORCED = 2, -} map_chg_state; - /* - * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW - * faults of hugetlb private mappings on top of a non-page-cache folio (in - * which case even if there's a private vma resv map it won't cover such - * allocation). New call sites should (probably) never set it to true!! - * When it's set, the allocation will bypass all vma level reservations. + * NOTE! "bypass_vma_reservations" represents a very niche usage, when CoW + * faults of hugetlb private mappings need to allocate a new page on top of a + * non-page-cache folio. In this situation, even if there's a private vma resv + * map, the resv map must be bypassed. New call sites should (probably) never + * set bypass_vma_reservations to true!! */ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, - unsigned long addr, bool cow_from_owner) + unsigned long addr, bool bypass_vma_reservations) { struct hugepage_subpool *spool = subpool_vma(vma); struct hstate *h = hstate_vma(vma); struct folio *folio; - long retval, gbl_chg; - map_chg_state map_chg; int ret, idx; struct hugetlb_cgroup *h_cg = NULL; gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; + bool vma_reservation_exists = false; + bool subpool_reservation_exists; + bool debit_subpool; + bool charge_cgroup_rsvd; idx = hstate_index(h); - /* Whether we need a separate per-vma reservation? */ - if (cow_from_owner) { - /* - * Special case! Since it's a CoW on top of a reserved - * page, the private resv map doesn't count. So it cannot - * consume the per-vma resv map even if it's reserved. - */ - map_chg = MAP_CHG_ENFORCED; - } else { + if (!bypass_vma_reservations) { /* * Examine the region/reserve map to determine if the process - * has a reservation for the page to be allocated. A return - * code of zero indicates a reservation exists (no change). - */ - retval = vma_needs_reservation(h, vma, addr); - if (retval < 0) + * has a reservation for the page to be allocated and debit the + * reservation. If npages_req == 0, a reservation exists and is + * used. If npages_req > 0, a reservation has to be taken either + * from the subpool or global pool. + */ + int npages_req = vma_needs_reservation(h, vma, addr); + if (npages_req < 0) return ERR_PTR(-ENOMEM); - map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE; + + vma_reservation_exists = npages_req == 0; } /* - * Whether we need a separate global reservation? + * If no vma reservation exists, debit the subpool. * + * Even if we were requested to bypass_vma_reservations, debit the + * subpool - the subpool still has to be charged for this allocation. + */ + debit_subpool = !vma_reservation_exists || bypass_vma_reservations; + + /* * Processes that did not create the mapping will have no * reserves as indicated by the region/reserve map. Check * that the allocation will not exceed the subpool limit. * Or if it can get one from the pool reservation directly. */ - if (map_chg) { - gbl_chg = hugepage_subpool_get_pages(spool, 1); - if (gbl_chg < 0) + if (debit_subpool) { + int npages_req = hugepage_subpool_get_pages(spool, 1); + if (npages_req < 0) goto out_end_reservation; - } else { + /* - * If we have the vma reservation ready, no need for extra - * global reservation. - */ - gbl_chg = 0; + * npages_req == 0 indicates a reservation exists for the + * allocation in the subpool and can be used. npages_req > 0 + * indicates that a reservation must be taken from the global + * pool. + */ + subpool_reservation_exists = npages_req == 0; + } else { + /* A vma reservation implies having a subpool reservation. */ + subpool_reservation_exists = vma_reservation_exists; } /* - * If this allocation is not consuming a per-vma reservation, - * charge the hugetlb cgroup now. + * If no vma reservation exists, charge the cgroup's reserved quota. + * + * Even if we were requested to bypass_vma_reservations, the cgroup + * still has to be charged for this allocation. */ - if (map_chg) { + charge_cgroup_rsvd = !vma_reservation_exists || bypass_vma_reservations; + if (charge_cgroup_rsvd) { ret = hugetlb_cgroup_charge_cgroup_rsvd( idx, pages_per_huge_page(h), &h_cg); if (ret) @@ -3039,12 +3021,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, goto out_uncharge_cgroup_reservation; spin_lock_irq(&hugetlb_lock); - /* - * glb_chg is passed to indicate whether or not a page must be taken - * from the global free pool (global change). gbl_chg == 0 indicates - * a reservation exists for the allocation. - */ - folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg); + + if (subpool_reservation_exists) { + folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr); + } else { + /* + * Since no subpool_reservation_exists, the allocation requires + * a new page that was not reserved before. Only dequeue if + * there are available pages. + */ + if (available_huge_pages(h)) { + folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr); + } else { + folio = NULL; + /* Fallthrough to allocate a new page. */ + } + } + if (!folio) { spin_unlock_irq(&hugetlb_lock); folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr); @@ -3057,19 +3050,17 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, } /* - * Either dequeued or buddy-allocated folio needs to add special - * mark to the folio when it consumes a global reservation. + * If subpool_reservation_exists (and is used for this allocation), + * decrement resv_huge_pages to indicate that a reservation was used. */ - if (!gbl_chg) { + if (subpool_reservation_exists) { folio_set_hugetlb_restore_reserve(folio); h->resv_huge_pages--; } hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio); - /* If allocation is not consuming a reservation, also store the - * hugetlb_cgroup pointer on the page. - */ - if (map_chg) { + + if (charge_cgroup_rsvd) { hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h), h_cg, folio); } @@ -3078,25 +3069,30 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, hugetlb_set_folio_subpool(folio, spool); - if (map_chg != MAP_CHG_ENFORCED) { - /* commit() is only needed if the map_chg is not enforced */ - retval = vma_commit_reservation(h, vma, addr); + if (!bypass_vma_reservations) { + /* + * As long as vma reservations were not bypassed, we need to + * commit() to clear up any adds_in_progress in resv_map. + */ + int ret = vma_commit_reservation(h, vma, addr); /* - * Check for possible race conditions. When it happens.. - * The page was added to the reservation map between - * vma_needs_reservation and vma_commit_reservation. - * This indicates a race with hugetlb_reserve_pages. + * If there is a discrepancy in reservation status between the + * time of vma_needs_reservation() and vma_commit_reservation(), + * then there the page must have been added to the reservation + * map between vma_needs_reservation() and + * vma_commit_reservation(). + * * Adjust for the subpool count incremented above AND * in hugetlb_reserve_pages for the same page. Also, * the reservation count added in hugetlb_reserve_pages * no longer applies. */ - if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) { + if (unlikely(!vma_reservation_exists && ret == 0)) { long rsv_adjust; rsv_adjust = hugepage_subpool_put_pages(spool, 1); hugetlb_acct_memory(h, -rsv_adjust); - if (map_chg) { + if (charge_cgroup_rsvd) { spin_lock_irq(&hugetlb_lock); hugetlb_cgroup_uncharge_folio_rsvd( hstate_index(h), pages_per_huge_page(h), @@ -3124,14 +3120,14 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, out_uncharge_cgroup: hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg); out_uncharge_cgroup_reservation: - if (map_chg) + if (charge_cgroup_rsvd) hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h), h_cg); out_subpool_put: - if (map_chg) + if (debit_subpool) hugepage_subpool_put_pages(spool, 1); out_end_reservation: - if (map_chg != MAP_CHG_ENFORCED) + if (!bypass_vma_reservations) vma_end_reservation(h, vma, addr); return ERR_PTR(-ENOSPC); } @@ -5900,6 +5896,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, * be acquired again before returning to the caller, as expected. */ spin_unlock(vmf->ptl); + + /* + * If this is a CoW from the owner of this page, we + * bypass_vma_reservations, since the reservation was already consumed + * when the hugetlb folio was first allocated before the fork happened. + */ new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner); if (IS_ERR(new_folio)) { -- 2.47.1.688.g23fc6f90ad-goog