On 2024/8/28 18:48, David Hildenbrand wrote:
On 27.08.24 06:33, Qi Zheng wrote:
Hi David,
On 2024/8/26 23:21, David Hildenbrand wrote:
On 22.08.24 09:13, Qi Zheng wrote:
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:
1) After acquiring PTL, only read-only operations are performed on the
PTE
page. In this case, the RCU lock in pte_offset_map_nolock() will
ensure
that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified.
There is also the usage where we don't grab the PTL at all, and only do
a racy (read-only) lookup.
IIUC, pte_offset_map() should be used instead of pte_offset_map_nolock()
in this case.
Yes, but the filemap.c thingy conditionally wants to lock later. But I
agree that pte_offset_map() is better when not even wanting to lock.
[...]
accessor functions:
- pte_offset_map_nolock()
maps PTE, returns pointer to PTE with pointer to its PTE table
lock (not taken), or returns NULL if no PTE table;
What will happen to pte_offset_map_nolock() after this series? Does it
still exist or will it become an internal helper?
I choose to remove it completely in [PATCH v2 13/14].
Ah, great.
[...]
If someone thinks not requiring a non-NULL pointer is better, please
speak up, I'm not married to that idea :)
+ pte = __pte_offset_map(pmd, addr, &pmdval);
+ if (likely(pte))
+ *ptlp = pte_lockptr(mm, &pmdval);
+ *pmdvalp = pmdval;
+ return pte;
+}
+
/*
* pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal
implementation
* __pte_offset_map_lock() below, is usually called with the pmd
pointer for
@@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct
*mm, pmd_t *pmd,
* recheck *pmd once the lock is taken; in practice, no callsite
needs that -
* either the mmap_lock for write, or pte_same() check on contents,
is enough.
*
+ * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
+ * pte_offset_map(); but when successful, it also outputs a pointer
to the
+ * spinlock in ptlp - as pte_offset_map_lock() does, but in this case
without
+ * locking it. This helps the caller to avoid a later
pte_lockptr(mm, *pmd),
+ * which might by that time act on a changed *pmd:
pte_offset_map_ro_nolock()
+ * provides the correct spinlock pointer for the page table that it
returns.
+ * For readonly case, the caller does not need to recheck *pmd after
the lock is
+ * taken, because the RCU lock will ensure that the PTE page will not
be freed. > + *
+ * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is
like
+ * pte_offset_map_ro_nolock(); but when successful, it also outputs
the
+ * pdmval. For cases where pte or pmd entries may be modified, that
is, maywrite
+ * case, this can help the caller recheck *pmd once the lock is
taken. In some
+ * cases, that is, either the mmap_lock for write, or pte_same()
check on
+ * contents, is also enough to ensure that the pmd entry is stable.
+ *
* Note that free_pgtables(), used after unmapping detached vmas, or
when
* exiting the whole mm, does not take page table lock before
freeing a page
* table, and may not use RCU at all: "outsiders" like khugepaged
should avoid
In general to me a step into the right direction. Likely the
documentation could be further clarified in some aspects:
Like that the use of pte_offset_map_ro_nolock() does not allow to easily
identify if the page table was replaced in the meantime. Even after
grabbing the PTL, we might be looking either at a page table that is
still mapped or one that was unmapped and is about to get freed. But for
R/O access this is usually sufficient AFAIUK.
Or that "RO" / "RW" expresses the intended semantics, not that the
*kmap* will be RO/RW protected.
How about the following:
pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
pte_offset_map(); but when successful, it also outputs a pointer to the
spinlock in ptlp - as pte_offset_map_lock() does, but in this case
without locking it. This helps the caller to avoid a later
pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd:
pte_offset_map_ro_nolock() provides the correct spinlock pointer for the
page table that it returns. Even after grabbing the spinlock, we might
be looking either at a page table that is still mapped or one that was
unmapped and is about to get freed. But for R/O access this is usually
sufficient AFAIUK.
Drop the "AFAIUK" :)
"For R/O access this is sufficient."
OK.
pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
pte_offset_map_ro_nolock(); but when successful, it also outputs the
pdmval. For R/W access, the callers can not accept that the page table
it sees has been unmapped and is about to get freed. The pmdval can help
callers to recheck pmd_same() to identify this case once the spinlock is
taken. For some cases where exclusivity is already guaranteed, such as
holding the write lock of mmap_lock, or in cases where checking is
sufficient, such as a !pte_none() pte will be rechecked after the
spinlock is taken, there is no need to recheck pdmval.
Right, using pte_same() one can achieve a similar result, assuming that
the freed page table gets all ptes set to pte_none().
page_table_check_pte_clear_range() before pte_free_defer() in
retract_page_tables/collapse_pte_mapped_thp() sanity checks that I think.
In collapse_huge_page() that is not the case. But here, we also
currently grab all heavily locks, to prevent any concurrent page table
walker.
Yes.
Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be RO/RW protected.
Good. Please also incorporate the feedback from Muchun.
OK, I will change it in v3 to the following:
pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
pte_offset_map(); but when successful, it also outputs a pointer to the
spinlock in ptlp - as pte_offset_map_lock() does, but in this case
without locking it. This helps the caller to avoid a later
pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd:
pte_offset_map_ro_nolock() provides the correct spinlock pointer for the
page table that it returns. Even after grabbing the spinlock, we might
be looking either at a page table that is still mapped or one that was
unmapped and is about to get freed. But for R/O access this is
sufficient. So it is only applicable for read-only cases where any
modification operations to the page table are not allowed even if
the corresponding spinlock is held afterwards.
pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
pte_offset_map_ro_nolock(); but when successful, it also outputs the
pdmval. It is applicable for may-write cases where any modification
operations to the page table may happen after the corresponding spinlock
is held afterwards. But the users should make sure the page table is
stable like holding mmap_lock for write or checking pte_same() or
checking pmd_same() by using the output pmdval before performing the
write operations.
Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.