Re: [Question]: major faults are still triggered after mlockall when numa balancing

Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> · Fri, 10 Nov 2023 11:39:24 +0800

On 2023/11/10 9:57, Yin, Fengwei wrote:

On 11/10/2023 6:54 AM, Yang Shi wrote:
On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@xxxxxxxxxx> wrote:

Hi everyone,

There is a performance issue that has been bothering us recently.
This problem can reproduce in the latest mainline version (Linux 6.6).

We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
to avoid performance problems caused by major fault.

There is a stage in numa fault which will set pte as 0 in do_numa_page() :
ptep_modify_prot_start() will clear the vmf->pte, until
ptep_modify_prot_commit() assign a value to the vmf->pte.

For the data segment of the user-mode program, the global variable area
is a private mapping. After the pagecache is loaded, the private
anonymous page is generated after the COW is triggered. Mlockall can
lock COW pages (anonymous pages), but the original file pages cannot
be locked and may be reclaimed. If the global variable (private anon page)
is accessed when vmf->pte is zero which is concurrently set by numa fault,
a file page fault will be triggered.

At this time, the original private file page may have been reclaimed.
If the page cache is not available at this time, a major fault will be
triggered and the file will be read, causing additional overhead.

Our problem scenario is as follows:

task 1                      task 2
------                      ------
/* scan global variables */
do_numa_page()
    spin_lock(vmf->ptl)
    ptep_modify_prot_start()
    /* set vmf->pte as null */
                              /* Access global variables */
                              handle_pte_fault()
                                /* no pte lock */
                                do_pte_missing()
                                  do_fault()
                                    do_read_fault()
    ptep_modify_prot_commit()
    /* ptep update done */
    pte_unmap_unlock(vmf->pte, vmf->ptl)
                                      do_fault_around()
                                      __do_fault()
                                        filemap_fault()
                                          /* page cache is not available
                                          and a major fault is triggered */
                                          do_sync_mmap_readahead()
                                          /* page_not_uptodate and goto
                                          out_retry. */

Is there any way to avoid such a major fault?

IMHO I don't think it is a bug. The man page quoted by Willy says "All
mapped pages are guaranteed to be resident in RAM when the call
returns successfully", but the later COW already made the file page
unmapped, right? The PTE pointed to the COW'ed anon page.
Hypothetically if we kept the file page mlocked and unmapped,
munlock() would have not munlocked the file page at all, it would be
mlocked in memory forever.
But in this case, even the COW page is mlocked. There is small window
that PTE is set to null in do_numa_page(). data segment access (it's to
COW page which has nothing to do with original page cache) happens in
this small window will trigger filemap_fault() to fault in original
page cache.

I had thought to do double check whether vmf->pte is NULL in do_read_fault().
But it's not reliable enough.

Matthew's idea to use protnone to block both hardware accessing and
do_pte_missing() looks more promising to me.

Actual， we could revert the following patch to avoid this issue,
but this workaroud from ppc...

commit cee216a696b2004017a5ecb583366093d90b1568
Author: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
Date:   Fri Feb 24 14:59:13 2017 -0800

    mm/autonuma: don't use set_pte_at when updating protnone ptes

    Architectures like ppc64, use privilege access bit to mark pte non
    accessible.  This implies that kernel can do a copy_to_user to an
    address marked for numa fault.  This also implies that there can be a
    parallel hardware update for the pte.  set_pte_at cannot be used in 
such
    scenarios.  Hence switch the pte update to use ptep_get_and_clear and
    set_pte_at combination.

Regards
Yin, Fengwei

--
Best Regards,
Peng