Re: [PATCH] docs/mm: add more warnings around page table access

Jann Horn <jannh@xxxxxxxxxx> · Fri, 15 Nov 2024 20:04:52 +0100

On Fri, Nov 15, 2024 at 7:02 PM Lorenzo Stoakes
<lorenzo.stoakes@xxxxxxxxxx> wrote:
> On Thu, Nov 14, 2024 at 10:12:00PM +0100, Jann Horn wrote:
> > Make it clearer that holding the mmap lock in read mode is not enough
> > to traverse page tables, and that just having a stable VMA is not enough
> > to read PTEs.
> >
> > Suggested-by: Matteo Rizzo <matteorizzo@xxxxxxxxxx>
> > Signed-off-by: Jann Horn <jannh@xxxxxxxxxx>
>
> Have some queries before we move forward so would like a little more
> clarification/perhaps putting some extra meat on the bones first.
>
> Broadly very glad you have done this however so it's just sorting details
> first! :>)
>
> > ---
> > @akpm: Please don't put this in your tree before Lorenzo has replied.
> >
> > @Lorenzo:
> > This is intended to go on top of your documentation patch.
> > If you think this is a sensible change, do you prefer to squash it into
> > your patch or do you prefer having akpm take this as a separate patch?
> > IDK what works better...
>
> I think a new patch is better, as I'd like the original to settle down now
> and the whole point of this doc is that it's a living thing that many
> people can contribute to, update, etc.
>
> For instance, Suren is updating as part of one of his series to correct
> things that he changes in that series, which is really nice.
>
> > ---
> >  Documentation/mm/process_addrs.rst | 21 +++++++++++++++++++--
> >  1 file changed, 19 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> > index 1bf7ad010fc063d003bb857bb3b695a3eafa0b55..9bdf073d0c3ebea1707812508a309aa4a6163660 100644
> > --- a/Documentation/mm/process_addrs.rst
> > +++ b/Documentation/mm/process_addrs.rst
> > @@ -339,6 +339,16 @@ When **installing** page table entries, the mmap or VMA lock must be held to
> >  keep the VMA stable. We explore why this is in the page table locking details
> >  section below.
> >
> > +.. warning:: Taking the mmap lock in read mode **is not sufficient** for
> > +             traversing page tables; you must also ensure that a VMA exists that
> > +             covers the range being accessed.
>
> Hm, but we say later we don't need _any_ locks for traversal, and here we
> say we need mmap read lock. Do you mean installing page table entries?
>
> Or do you mean to say, that if you don't span a VMA, you must acquire a
> write lock at least to preclude this?

Yeah, this is what I meant with the "you must also ensure that a VMA
exists" part. (Context is that I was looking at some non-upstream code
that was trying to do exactly this - take a userspace-supplied address
and walk down the page tables at that address. Upstream also once had
a UAF in pagemap_walk() because walk_page_range() was used while
holding only the mmap read lock, see
<https://project-zero.issues.chromium.org/42451485>.)

> This seems quite unclear.
>
> I kind of didn't want to touch on the horrors of fiddling about without a
> VMA, so I'd rather this very clearly say something like 'it is unusual to
> manipulate page tables wihch are not spanned by a VMA, and there are
> special requirements for this operation' etc. et.c otherwise this just adds
> more noise and confusion I think.

I guess maybe we could replace this entire warning block with
something like this?

".. warning:: Page tables are normally only traversed in regions
covered by VMAs. If you want to traverse page tables in areas that
might not be covered by VMAs, heavier locking is required. See
:c:func:`!walk_page_range_novma` for details."

And walk_page_range_novma() already has a comment block talking about
why an mmap read lock isn't enough to traverse user virtual address
space outside VMAs.

> > +             This ensures you can't race with concurrent page table removal
> > +             which happens with the mmap lock in read mode, in regions whose
> > +             VMAs are no longer present in the VMA tree.
> > +
> > +             (Alternatively, the mmap lock can be taken in write mode, but that
> > +             is heavy-handed and almost never the right choice.)
>
> You kind of need to expand on why that is I think!
>
> > +
> >  **Freeing** page tables is an entirely internal memory management operation and
> >  has special requirements (see the page freeing section below for more details).
> >
> > @@ -450,6 +460,9 @@ the time of writing of this document.
> >  Locking Implementation Details
> >  ------------------------------
> >
> > +.. warning:: Locking rules for PTE-level page tables are very different from
> > +             locking rules for page tables at other levels.
> > +
> >  Page table locking details
> >  --------------------------
> >
> > @@ -470,8 +483,12 @@ additional locks dedicated to page tables:
> >  These locks represent the minimum required to interact with each page table
> >  level, but there are further requirements.
> >
> > -Importantly, note that on a **traversal** of page tables, no such locks are
> > -taken. Whether care is taken on reading the page table entries depends on the
> > +Importantly, note that on a **traversal** of page tables, sometimes no such
> > +locks are taken. However, at the PTE level, at least concurrent page table
> > +deletion must be prevented (using RCU) and the page table must be mapped into
> > +high memory, see below.
>
> Ugh I really do hate that we have to think about high memory. I'd like to
> sort of deny it exists. But I suppose that's not an option.

(FWIW from a quick look I think only arm and x86 actually have this behavior.)

I mean... we could try to just make userspace page tables live in
non-high-memory based on the assumption that nobody nowadays is going
to run a 32-bit kernel on a device with significantly more than 4G or
8G or so of RAM, and probably most memory is used for stuff like
anon/file pages, not for page tables... but I don't think it actually
matters much for code complexity now that we need RCU for page table
access anyway.

> As for the RCU thing, I guess this is why pte_offset_map_lock() is taking
> it? Maybe worth mentioning something there or updating that 'interestingly'
> block... :>)

Yeah, and not just pte_offset_map_lock() but also pte_offset_map() and
such which don't lock the page table.

> Or am I mistaken? I wasn't aware of this requirement, is this sort of
> implied by the gup_fast() IRQ disabling stuff?

This is to protect against khugepaged/MADV_COLLAPSE deleting page
tables while holding only an rmap lock (in read mode), the pmd lock,
and the pte lock. So a concurrent call to MADV_COLLAPSE (potentially
in another process) can (via retract_page_tables()) remove the page
table while we're traversing it, but the page table will only actually
be freed via RCU (thanks to pte_free_defer), and so this doesn't cause
use-after-free. pte_offset_map_lock() ensures that on success, it has
locked the current page table (holding the pte lock prevents the page
table being removed in the future, and rechecking the pmd entry
ensures it hasn't already been removed), so we never end up installing
PTEs into page tables that have already been removed. pte_offset_map()
does not ensure that you always see the latest page table; it could be
that by the time you're looking at a PTE inside it, the page table
you're looking at has been removed, and a new page table has been
installed in its place and populated with PTEs. But removed page
tables are always empty, so the only possible consequence of such a
race is that you wrongly think that there is no PTE at a given
address.

The fourth paragraph of the comment block above
__pte_offset_map_lock() sorta explains this.

> Please expand :)