[RFC PATCH 0/3] asynchronously scan and free empty user PTE pages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

This series aims to asynchronously scan and free empty user PTE pages.

1. Background
=============

We often find huge user PTE memory usage on our servers, such as the following:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g
        
The root cause is that these processes use some high-performance mmeory
allocators (such as jemalloc, tcmalloc, etc). These memory allocators use
madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither
MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause
huge page table memory usage.

This issue has been discussed on LSFMM 2022 (led by David Hildenbrand):

        topic link: https://lore.kernel.org/linux-mm/7b908208-02f8-6fde-4dfc-13d5e00310a6@xxxxxxxxxx/
        youtube link: https://www.youtube.com/watch?v=naO_BRhcU68
        
In the past, I have tried to introduce refcount for PTE pages to solve this
problem, but these methods [1][2][3] introduced too much complexity.

[1]. https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@xxxxxxxxxxxxx/
[2]. https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@xxxxxxxxxxxxx/
[3]. https://lore.kernel.org/lkml/20220825101037.96517-1-zhengqi.arch@xxxxxxxxxxxxx/

2. Infrastructure
=================

Later, in order to freeing retracted page table, Hugh Dickins added a lot of
PTE-related infrastructure[4][5][6]:

    - allow pte_offset_map_lock() etc to fail
    - make PTE pages can be removed without mmap or rmap locks
      (see collapse_pte_mapped_thp() and retract_page_tables())
    - make PTE pages can be freed by RCU (via pte_free_defer())
    - etc
    
These are all beneficial to freeing empty PTE pages.

[4]. https://lore.kernel.org/all/a4963be9-7aa6-350-66d0-2ba843e1af44@xxxxxxxxxx/
[5]. https://lore.kernel.org/all/c1c9a74a-bc5b-15ea-e5d2-8ec34bc921d@xxxxxxxxxx/
[6]. https://lore.kernel.org/all/7cd843a9-aa80-14f-5eb2-33427363c20@xxxxxxxxxx/

3. Implementation
=================

For empty user PTE pages, we don't actually need to free it immediately, nor do
we need to free all of it.

Therefore, in this patchset, we register a task_work for the user tasks to
asyncronously scan and free empty PTE pages when they return to user space.
(The scanning time interval and address space size can be adjusted.)

When scanning, we can filter out some unsuitable vmas:

    - VM_HUGETLB vma
    - VM_UFFD_WP vma
    - etc
    
And for some PTE pages that spans multiple vmas, we can also skip.

For locking:

    - use the mmap read lock to traverse the vma tree and pgtable
    - use pmd lock for clearing pmd entry
    - use pte lock for checking empty PTE page, and release it after clearing
      pmd entry, then we can capture the changed pmd in pte_offset_map_lock()
      etc after holding this pte lock. Thanks to this, we don't need to hold the
      rmap-related locks.
    - users of pte_offset_map_lock() etc all expect the PTE page to be stable by
      using rcu lock, so use pte_free_defer() to free PTE pages.
      
For the path that will also free PTE pages in THP, we need to recheck whether the
content of pmd entry is valid after holding pmd lock or pte lock.

4. TODO
=======

Some applications may be concerned about the overhead of scanning and rebuilding
page tables, so the following features are considered for implementation in the
future:

    - add per-process switch (via prctl)
    - add a madvise option (like THP)
    - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via procfs file)
    
Perhaps we can add the refcount to PTE pages in the future as well, which would
help improve the scanning speed.

This series is based on next-20240612.

Comments and suggestions are welcome!

Thanks,
Qi

Qi Zheng (3):
  mm: pgtable: move pte_free_defer() out of CONFIG_TRANSPARENT_HUGEPAGE
  mm: pgtable: make pte_offset_map_nolock() return pmdval
  mm: free empty user PTE pages

 Documentation/mm/split_page_table_lock.rst |   3 +-
 arch/arm/mm/fault-armv.c                   |   2 +-
 arch/powerpc/mm/pgtable-frag.c             |   2 -
 arch/powerpc/mm/pgtable.c                  |   2 +-
 arch/s390/mm/pgalloc.c                     |   2 -
 arch/sparc/mm/init_64.c                    |   2 +-
 include/linux/mm.h                         |   4 +-
 include/linux/mm_types.h                   |   4 +
 include/linux/pgtable.h                    |  14 ++
 include/linux/sched.h                      |   1 +
 kernel/sched/core.c                        |   1 +
 kernel/sched/fair.c                        |   2 +
 mm/Makefile                                |   2 +-
 mm/filemap.c                               |   2 +-
 mm/freept.c                                | 180 +++++++++++++++++++++
 mm/khugepaged.c                            |  20 ++-
 mm/memory.c                                |   4 +-
 mm/mremap.c                                |   2 +-
 mm/page_vma_mapped.c                       |   2 +-
 mm/pgtable-generic.c                       |  23 +--
 mm/userfaultfd.c                           |   4 +-
 mm/vmscan.c                                |   2 +-
 22 files changed, 249 insertions(+), 31 deletions(-)
 create mode 100644 mm/freept.c

-- 
2.20.1





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux