Re: [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()

Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> · Tue, 27 Aug 2024 12:33:53 +0800

Hi David,

On 2024/8/26 23:21, David Hildenbrand wrote:
On 22.08.24 09:13, Qi Zheng wrote:
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:

1) After acquiring PTL, only read-only operations are performed on the 
PTE
    page. In this case, the RCU lock in pte_offset_map_nolock() will 
ensure
    that the PTE page will not be freed, and there is no need to worry
    about whether the pmd entry is modified.

There is also the usage where we don't grab the PTL at all, and only do 
a racy (read-only) lookup.

IIUC, pte_offset_map() should be used instead of pte_offset_map_nolock()
in this case.



2) After acquiring PTL, the pte or pmd entries may be modified. At this
    time, we need to ensure that the pmd entry has not been modified
    concurrently.

To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. This can help the caller recheck *pmd once the 
PTL
is taken. In some cases, that is, either the mmap_lock for write, or
pte_same() check on contents, is also enough to ensure that the pmd entry
is stable. But in order to prevent the interface from being abused, we
choose to pass in a dummy local variable instead of NULL.

Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.

Signed-off-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
---
  Documentation/mm/split_page_table_lock.rst |  7 ++++
  include/linux/mm.h                         |  5 +++
  mm/pgtable-generic.c                       | 43 ++++++++++++++++++++++
  3 files changed, 55 insertions(+)

diff --git a/Documentation/mm/split_page_table_lock.rst 
b/Documentation/mm/split_page_table_lock.rst
index e4f6972eb6c04..08d0e706a32db 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other 
accessor functions:
   - pte_offset_map_nolock()
      maps PTE, returns pointer to PTE with pointer to its PTE table
      lock (not taken), or returns NULL if no PTE table;

What will happen to pte_offset_map_nolock() after this series? Does it 
still exist or will it become an internal helper?

I choose to remove it completely in [PATCH v2 13/14].


+ - pte_offset_map_ro_nolock()
+    maps PTE, returns pointer to PTE with pointer to its PTE table
+    lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_rw_nolock()
+    maps PTE, returns pointer to PTE with pointer to its PTE table
+    lock (not taken) and the value of its pmd entry, or returns NULL
+    if no PTE table;

[...]

+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+                unsigned long addr, pmd_t *pmdvalp,
+                spinlock_t **ptlp)
+{
+    pmd_t pmdval;
+    pte_t *pte;
+
+    BUG_ON(!pmdvalp);

As raised, no BUG_ON please. VM_WARN_ON_ONCE() is helpful during early 
testing and should catch these kind of things.

OK, this patch was sent before you pointed out this, will use
VM_WARN_ON_ONCE() instead of BUG_ON() in v3.


If someone thinks not requiring a non-NULL pointer is better, please 
speak up, I'm not married to that idea :)

+    pte = __pte_offset_map(pmd, addr, &pmdval);
+    if (likely(pte))
+        *ptlp = pte_lockptr(mm, &pmdval);
+    *pmdvalp = pmdval;
+    return pte;
+}
+
  /*
   * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal 
implementation
   * __pte_offset_map_lock() below, is usually called with the pmd 
pointer for
@@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct 
*mm, pmd_t *pmd,
   * recheck *pmd once the lock is taken; in practice, no callsite 
needs that -
   * either the mmap_lock for write, or pte_same() check on contents, 
is enough.
   *
+ * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
+ * pte_offset_map(); but when successful, it also outputs a pointer 
to the
+ * spinlock in ptlp - as pte_offset_map_lock() does, but in this case 
without
+ * locking it.  This helps the caller to avoid a later 
pte_lockptr(mm, *pmd),
+ * which might by that time act on a changed *pmd: 
pte_offset_map_ro_nolock()
+ * provides the correct spinlock pointer for the page table that it 
returns.
+ * For readonly case, the caller does not need to recheck *pmd after 
the lock is
+ * taken, because the RCU lock will ensure that the PTE page will not 
be freed. > + *
+ * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is 
like
+ * pte_offset_map_ro_nolock(); but when successful, it also outputs the
+ * pdmval. For cases where pte or pmd entries may be modified, that 
is, maywrite
+ * case, this can help the caller recheck *pmd once the lock is 
taken. In some
+ * cases, that is, either the mmap_lock for write, or pte_same() 
check on
+ * contents, is also enough to ensure that the pmd entry is stable.
+ *
   * Note that free_pgtables(), used after unmapping detached vmas, or 
when
   * exiting the whole mm, does not take page table lock before 
freeing a page
   * table, and may not use RCU at all: "outsiders" like khugepaged 
should avoid

In general to me a step into the right direction. Likely the 
documentation could be further clarified in some aspects:

Like that the use of pte_offset_map_ro_nolock() does not allow to easily 
identify if the page table was replaced in the meantime. Even after 
grabbing the PTL, we might be looking either at a page table that is 
still mapped or one that was unmapped and is about to get freed. But for 
R/O access this is usually sufficient AFAIUK.

Or that "RO" / "RW" expresses the intended semantics, not that the 
*kmap* will be RO/RW protected.

How about the following:

pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
pte_offset_map(); but when successful, it also outputs a pointer to the
spinlock in ptlp - as pte_offset_map_lock() does, but in this case
without locking it.  This helps the caller to avoid a later
pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd:
pte_offset_map_ro_nolock() provides the correct spinlock pointer for the
page table that it returns. Even after grabbing the spinlock, we might
be looking either at a page table that is still mapped or one that was 
unmapped and is about to get freed. But for R/O access this is usually
sufficient AFAIUK.

pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
pte_offset_map_ro_nolock(); but when successful, it also outputs the
pdmval. For R/W access, the callers can not accept that the page table
it sees has been unmapped and is about to get freed. The pmdval can help
callers to recheck pmd_same() to identify this case once the spinlock is
taken. For some cases where exclusivity is already guaranteed, such as
holding the write lock of mmap_lock, or in cases where checking is
sufficient, such as a !pte_none() pte will be rechecked after the
spinlock is taken, there is no need to recheck pdmval.

Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be RO/RW protected.

Thanks,
Qi