On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote: > On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > > > > On 10/02/23 17:18, Nhat Pham wrote: > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index de220e3ff8be..74472e911b0a 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio) > > > pages_per_huge_page(h), folio); > > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > > > pages_per_huge_page(h), folio); > > > + mem_cgroup_uncharge(folio); > > > if (restore_reserve) > > > h->resv_huge_pages++; > > > > > > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > struct hugepage_subpool *spool = subpool_vma(vma); > > > struct hstate *h = hstate_vma(vma); > > > struct folio *folio; > > > - long map_chg, map_commit; > > > + long map_chg, map_commit, nr_pages = pages_per_huge_page(h); > > > long gbl_chg; > > > - int ret, idx; > > > + int memcg_charge_ret, ret, idx; > > > struct hugetlb_cgroup *h_cg = NULL; > > > + struct mem_cgroup *memcg; > > > bool deferred_reserve; > > > + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; > > > + > > > + memcg = get_mem_cgroup_from_current(); > > > + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages); > > > + if (memcg_charge_ret == -ENOMEM) { > > > + mem_cgroup_put(memcg); > > > + return ERR_PTR(-ENOMEM); > > > + } > > > > > > idx = hstate_index(h); > > > /* > > > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > * code of zero indicates a reservation exists (no change). > > > */ > > > map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); > > > - if (map_chg < 0) > > > + if (map_chg < 0) { > > > + if (!memcg_charge_ret) > > > + mem_cgroup_cancel_charge(memcg, nr_pages); > > > + mem_cgroup_put(memcg); > > > return ERR_PTR(-ENOMEM); > > > + } > > > > > > /* > > > * Processes that did not create the mapping will have no > > > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > */ > > > if (map_chg || avoid_reserve) { > > > gbl_chg = hugepage_subpool_get_pages(spool, 1); > > > - if (gbl_chg < 0) { > > > - vma_end_reservation(h, vma, addr); > > > - return ERR_PTR(-ENOSPC); > > > - } > > > + if (gbl_chg < 0) > > > + goto out_end_reservation; > > > > > > /* > > > * Even though there was no reservation in the region/reserve > > > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > > > pages_per_huge_page(h), folio); > > > } > > > + > > > + if (!memcg_charge_ret) > > > + mem_cgroup_commit_charge(folio, memcg); > > > + mem_cgroup_put(memcg); > > > + > > > return folio; > > > > > > out_uncharge_cgroup: > > > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > out_subpool_put: > > > if (map_chg || avoid_reserve) > > > hugepage_subpool_put_pages(spool, 1); > > > +out_end_reservation: > > > vma_end_reservation(h, vma, addr); > > > + if (!memcg_charge_ret) > > > + mem_cgroup_cancel_charge(memcg, nr_pages); > > > + mem_cgroup_put(memcg); > > > return ERR_PTR(-ENOSPC); > > > } > > > > > > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in > > free_huge_folio. During migration, huge pages are allocated via > > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no > > charging for the migration target page and we uncharge the source page. > > It looks like there will be no charge for the huge page after migration? > > > > Ah I see! This is a bit subtle indeed. > > For the hugetlb controller, it looks like they update the cgroup info > inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate() > to transfer the hugetlb cgroup info to the destination folio. > > Perhaps we can do something analogous here. > > > If my analysis above is correct, then we may need to be careful about > > this accounting. We may not want both source and target pages to be > > charged at the same time. > > We can create a variant of mem_cgroup_migrate that does not double > charge, but instead just copy the mem_cgroup information to the new > folio, and then clear that info from the old folio. That way the memory > usage counters are untouched. > > Somebody with more expertise on migration should fact check me > of course :) The only reason mem_cgroup_migrate() double charges right now is because it's used by replace_page_cache_folio(). In that context, the isolation of the old page isn't quite as thorough as with migration, so it cannot transfer and uncharge directly. This goes back a long time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40 If you rename the current implementation to mem_cgroup_replace_page() for that one caller, you can add a mem_cgroup_migrate() variant which is charge neutral and clears old->memcg_data. This can be used for regular and hugetlb page migration. Something like this (totally untested): diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a4d3282493b6..17ec45bf3653 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) if (mem_cgroup_disabled()) return; - /* Page cache replacement: new folio already charged? */ - if (folio_memcg(new)) - return; - memcg = folio_memcg(old); VM_WARN_ON_ONCE_FOLIO(!memcg, old); if (!memcg) return; - /* Force-charge the new page. The old one will be freed soon */ - if (!mem_cgroup_is_root(memcg)) { - page_counter_charge(&memcg->memory, nr_pages); - if (do_memsw_account()) - page_counter_charge(&memcg->memsw, nr_pages); - } - - css_get(&memcg->css); + /* Transfer the charge and the css ref */ commit_charge(new, memcg); - - local_irq_save(flags); - mem_cgroup_charge_statistics(memcg, nr_pages); - memcg_check_events(memcg, folio_nid(new)); - local_irq_restore(flags); + old->memcg_data = 0; } DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);