The patch titled Subject: mm: memcontrol: charge swapin pages on instantiation has been added to the -mm tree. Its filename is mm-memcontrol-charge-swapin-pages-on-instantiation.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-memcontrol-charge-swapin-pages-on-instantiation.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-memcontrol-charge-swapin-pages-on-instantiation.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Johannes Weiner <hannes@xxxxxxxxxxx> Subject: mm: memcontrol: charge swapin pages on instantiation Right now, users that are otherwise memory controlled can easily escape their containment and allocate significant amounts of memory that they're not being charged for. That's because swap readahead pages are not being charged until somebody actually faults them into their page table. This can be exploited with MADV_WILLNEED, which triggers arbitrary readahead allocations without charging the pages. There are additional problems with the delayed charging of swap pages: 1. To implement refault/workingset detection for anonymous pages, we need to have a target LRU available at swapin time, but the LRU is not determinable until the page has been charged. 2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be stable when the page is isolated from the LRU; otherwise, the locks change under us. But swapcache gets charged after it's already on the LRU, and even if we cannot isolate it ourselves (since charging is not exactly optional). The previous patch ensured we always maintain cgroup ownership records for swap pages. This patch moves the swapcache charging point from the fault handler to swapin time to fix all of the above problems. v2: simplify swapin error checking (Joonsoo) Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@xxxxxxxxxxx Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> Reviewed-by: Alex Shi <alex.shi@xxxxxxxxxxxxxxxxx> Cc: Hugh Dickins <hughd@xxxxxxxxxx> Cc: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> Cc: "Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxxx> Cc: Roman Gushchin <guro@xxxxxx> Cc: Shakeel Butt <shakeelb@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/memory.c | 15 +++++-- mm/shmem.c | 14 ++++--- mm/swap_state.c | 89 ++++++++++++++++++++++++---------------------- mm/swapfile.c | 6 --- 4 files changed, 67 insertions(+), 57 deletions(-) --- a/mm/memory.c~mm-memcontrol-charge-swapin-pages-on-instantiation +++ a/mm/memory.c @@ -3127,9 +3127,20 @@ vm_fault_t do_swap_page(struct vm_fault page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); if (page) { + int err; + __SetPageLocked(page); __SetPageSwapBacked(page); set_page_private(page, entry.val); + + /* Tell memcg to use swap ownership records */ + SetPageSwapCache(page); + err = mem_cgroup_charge(page, vma->vm_mm, + GFP_KERNEL, false); + ClearPageSwapCache(page); + if (err) + goto out_page; + lru_cache_add_anon(page); swap_readpage(page, true); } @@ -3191,10 +3202,6 @@ vm_fault_t do_swap_page(struct vm_fault goto out_page; } - if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL, true)) { - ret = VM_FAULT_OOM; - goto out_page; - } cgroup_throttle_swaprate(page, GFP_KERNEL); /* --- a/mm/shmem.c~mm-memcontrol-charge-swapin-pages-on-instantiation +++ a/mm/shmem.c @@ -623,13 +623,15 @@ static int shmem_add_to_page_cache(struc page->mapping = mapping; page->index = index; - error = mem_cgroup_charge(page, charge_mm, gfp, PageSwapCache(page)); - if (error) { - if (!PageSwapCache(page) && PageTransHuge(page)) { - count_vm_event(THP_FILE_FALLBACK); - count_vm_event(THP_FILE_FALLBACK_CHARGE); + if (!PageSwapCache(page)) { + error = mem_cgroup_charge(page, charge_mm, gfp, false); + if (error) { + if (PageTransHuge(page)) { + count_vm_event(THP_FILE_FALLBACK); + count_vm_event(THP_FILE_FALLBACK_CHARGE); + } + goto error; } - goto error; } cgroup_throttle_swaprate(page, gfp); --- a/mm/swapfile.c~mm-memcontrol-charge-swapin-pages-on-instantiation +++ a/mm/swapfile.c @@ -1862,11 +1862,6 @@ static int unuse_pte(struct vm_area_stru if (unlikely(!page)) return -ENOMEM; - if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL, true)) { - ret = -ENOMEM; - goto out_nolock; - } - pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); if (unlikely(!pte_same_as_swp(*pte, swp_entry_to_pte(entry)))) { ret = 0; @@ -1892,7 +1887,6 @@ static int unuse_pte(struct vm_area_stru activate_page(page); out: pte_unmap_unlock(pte, ptl); -out_nolock: if (page != swapcache) { unlock_page(page); put_page(page); --- a/mm/swap_state.c~mm-memcontrol-charge-swapin-pages-on-instantiation +++ a/mm/swap_state.c @@ -360,12 +360,13 @@ struct page *__read_swap_cache_async(swp struct vm_area_struct *vma, unsigned long addr, bool *new_page_allocated) { - struct page *found_page = NULL, *new_page = NULL; struct swap_info_struct *si; - int err; + struct page *page; + *new_page_allocated = false; - do { + for (;;) { + int err; /* * First check the swap cache. Since this is normally * called after lookup_swap_cache() failed, re-calling @@ -373,12 +374,12 @@ struct page *__read_swap_cache_async(swp */ si = get_swap_device(entry); if (!si) - break; - found_page = find_get_page(swap_address_space(entry), - swp_offset(entry)); + return NULL; + page = find_get_page(swap_address_space(entry), + swp_offset(entry)); put_swap_device(si); - if (found_page) - break; + if (page) + return page; /* * Just skip read ahead for unused swap slot. @@ -389,21 +390,15 @@ struct page *__read_swap_cache_async(swp * else swap_off will be aborted if we return NULL. */ if (!__swp_swapcount(entry) && swap_slot_cache_enabled) - break; - - /* - * Get a new page to read into from swap. - */ - if (!new_page) { - new_page = alloc_page_vma(gfp_mask, vma, addr); - if (!new_page) - break; /* Out of memory */ - } + return NULL; /* * Swap entry may have been freed since our caller observed it. */ err = swapcache_prepare(entry); + if (!err) + break; + if (err == -EEXIST) { /* * We might race against get_swap_page() and stumble @@ -412,31 +407,43 @@ struct page *__read_swap_cache_async(swp */ cond_resched(); continue; - } else if (err) /* swp entry is obsolete ? */ - break; - - /* May fail (-ENOMEM) if XArray node allocation failed. */ - __SetPageLocked(new_page); - __SetPageSwapBacked(new_page); - err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL); - if (likely(!err)) { - /* Initiate read into locked page */ - SetPageWorkingset(new_page); - lru_cache_add_anon(new_page); - *new_page_allocated = true; - return new_page; } - __ClearPageLocked(new_page); - /* - * add_to_swap_cache() doesn't return -EEXIST, so we can safely - * clear SWAP_HAS_CACHE flag. - */ - put_swap_page(new_page, entry); - } while (err != -ENOMEM); - if (new_page) - put_page(new_page); - return found_page; + return NULL; + } + + /* + * The swap entry is ours to swap in. Prepare a new page. + */ + + page = alloc_page_vma(gfp_mask, vma, addr); + if (!page) + goto fail_free; + + __SetPageLocked(page); + __SetPageSwapBacked(page); + + /* May fail (-ENOMEM) if XArray node allocation failed. */ + if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) + goto fail_unlock; + + if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL, false)) + goto fail_delete; + + /* Initiate read into locked page */ + SetPageWorkingset(page); + lru_cache_add_anon(page); + *new_page_allocated = true; + return page; + +fail_delete: + delete_from_swap_cache(page); +fail_unlock: + unlock_page(page); + put_page(page); +fail_free: + swap_free(entry); + return NULL; } /* _ Patches currently in -mm which might be from hannes@xxxxxxxxxxx are mm-fix-numa-node-file-count-error-in-replace_page_cache.patch mm-memcontrol-fix-stat-corrupting-race-in-charge-moving.patch mm-memcontrol-drop-compound-parameter-from-memcg-charging-api.patch mm-memcontrol-move-out-cgroup-swaprate-throttling.patch mm-memcontrol-convert-page-cache-to-a-new-mem_cgroup_charge-api.patch mm-memcontrol-prepare-uncharging-for-removal-of-private-page-type-counters.patch mm-memcontrol-prepare-move_account-for-removal-of-private-page-type-counters.patch mm-memcontrol-prepare-cgroup-vmstat-infrastructure-for-native-anon-counters.patch mm-memcontrol-switch-to-native-nr_file_pages-and-nr_shmem-counters.patch mm-memcontrol-switch-to-native-nr_anon_mapped-counter.patch mm-memcontrol-switch-to-native-nr_anon_thps-counter.patch mm-memcontrol-convert-anon-and-file-thp-to-new-mem_cgroup_charge-api.patch mm-memcontrol-drop-unused-try-commit-cancel-charge-api.patch mm-memcontrol-prepare-swap-controller-setup-for-integration.patch mm-memcontrol-make-swap-tracking-an-integral-part-of-memory-control.patch mm-memcontrol-charge-swapin-pages-on-instantiation.patch mm-memcontrol-delete-unused-lrucare-handling.patch mm-memcontrol-update-page-mem_cgroup-stability-rules.patch