On Fri, Nov 15, 2024 at 7:02 PM Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote: > On Thu, Nov 14, 2024 at 10:12:00PM +0100, Jann Horn wrote: > > Make it clearer that holding the mmap lock in read mode is not enough > > to traverse page tables, and that just having a stable VMA is not enough > > to read PTEs. > > > > Suggested-by: Matteo Rizzo <matteorizzo@xxxxxxxxxx> > > Signed-off-by: Jann Horn <jannh@xxxxxxxxxx> > > Have some queries before we move forward so would like a little more > clarification/perhaps putting some extra meat on the bones first. > > Broadly very glad you have done this however so it's just sorting details > first! :>) > > > --- > > @akpm: Please don't put this in your tree before Lorenzo has replied. > > > > @Lorenzo: > > This is intended to go on top of your documentation patch. > > If you think this is a sensible change, do you prefer to squash it into > > your patch or do you prefer having akpm take this as a separate patch? > > IDK what works better... > > I think a new patch is better, as I'd like the original to settle down now > and the whole point of this doc is that it's a living thing that many > people can contribute to, update, etc. > > For instance, Suren is updating as part of one of his series to correct > things that he changes in that series, which is really nice. > > > --- > > Documentation/mm/process_addrs.rst | 21 +++++++++++++++++++-- > > 1 file changed, 19 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst > > index 1bf7ad010fc063d003bb857bb3b695a3eafa0b55..9bdf073d0c3ebea1707812508a309aa4a6163660 100644 > > --- a/Documentation/mm/process_addrs.rst > > +++ b/Documentation/mm/process_addrs.rst > > @@ -339,6 +339,16 @@ When **installing** page table entries, the mmap or VMA lock must be held to > > keep the VMA stable. We explore why this is in the page table locking details > > section below. > > > > +.. warning:: Taking the mmap lock in read mode **is not sufficient** for > > + traversing page tables; you must also ensure that a VMA exists that > > + covers the range being accessed. > > Hm, but we say later we don't need _any_ locks for traversal, and here we > say we need mmap read lock. Do you mean installing page table entries? > > Or do you mean to say, that if you don't span a VMA, you must acquire a > write lock at least to preclude this? Yeah, this is what I meant with the "you must also ensure that a VMA exists" part. (Context is that I was looking at some non-upstream code that was trying to do exactly this - take a userspace-supplied address and walk down the page tables at that address. Upstream also once had a UAF in pagemap_walk() because walk_page_range() was used while holding only the mmap read lock, see <https://project-zero.issues.chromium.org/42451485>.) > This seems quite unclear. > > I kind of didn't want to touch on the horrors of fiddling about without a > VMA, so I'd rather this very clearly say something like 'it is unusual to > manipulate page tables wihch are not spanned by a VMA, and there are > special requirements for this operation' etc. et.c otherwise this just adds > more noise and confusion I think. I guess maybe we could replace this entire warning block with something like this? ".. warning:: Page tables are normally only traversed in regions covered by VMAs. If you want to traverse page tables in areas that might not be covered by VMAs, heavier locking is required. See :c:func:`!walk_page_range_novma` for details." And walk_page_range_novma() already has a comment block talking about why an mmap read lock isn't enough to traverse user virtual address space outside VMAs. > > + This ensures you can't race with concurrent page table removal > > + which happens with the mmap lock in read mode, in regions whose > > + VMAs are no longer present in the VMA tree. > > + > > + (Alternatively, the mmap lock can be taken in write mode, but that > > + is heavy-handed and almost never the right choice.) > > You kind of need to expand on why that is I think! > > > + > > **Freeing** page tables is an entirely internal memory management operation and > > has special requirements (see the page freeing section below for more details). > > > > @@ -450,6 +460,9 @@ the time of writing of this document. > > Locking Implementation Details > > ------------------------------ > > > > +.. warning:: Locking rules for PTE-level page tables are very different from > > + locking rules for page tables at other levels. > > + > > Page table locking details > > -------------------------- > > > > @@ -470,8 +483,12 @@ additional locks dedicated to page tables: > > These locks represent the minimum required to interact with each page table > > level, but there are further requirements. > > > > -Importantly, note that on a **traversal** of page tables, no such locks are > > -taken. Whether care is taken on reading the page table entries depends on the > > +Importantly, note that on a **traversal** of page tables, sometimes no such > > +locks are taken. However, at the PTE level, at least concurrent page table > > +deletion must be prevented (using RCU) and the page table must be mapped into > > +high memory, see below. > > Ugh I really do hate that we have to think about high memory. I'd like to > sort of deny it exists. But I suppose that's not an option. (FWIW from a quick look I think only arm and x86 actually have this behavior.) I mean... we could try to just make userspace page tables live in non-high-memory based on the assumption that nobody nowadays is going to run a 32-bit kernel on a device with significantly more than 4G or 8G or so of RAM, and probably most memory is used for stuff like anon/file pages, not for page tables... but I don't think it actually matters much for code complexity now that we need RCU for page table access anyway. > As for the RCU thing, I guess this is why pte_offset_map_lock() is taking > it? Maybe worth mentioning something there or updating that 'interestingly' > block... :>) Yeah, and not just pte_offset_map_lock() but also pte_offset_map() and such which don't lock the page table. > Or am I mistaken? I wasn't aware of this requirement, is this sort of > implied by the gup_fast() IRQ disabling stuff? This is to protect against khugepaged/MADV_COLLAPSE deleting page tables while holding only an rmap lock (in read mode), the pmd lock, and the pte lock. So a concurrent call to MADV_COLLAPSE (potentially in another process) can (via retract_page_tables()) remove the page table while we're traversing it, but the page table will only actually be freed via RCU (thanks to pte_free_defer), and so this doesn't cause use-after-free. pte_offset_map_lock() ensures that on success, it has locked the current page table (holding the pte lock prevents the page table being removed in the future, and rechecking the pmd entry ensures it hasn't already been removed), so we never end up installing PTEs into page tables that have already been removed. pte_offset_map() does not ensure that you always see the latest page table; it could be that by the time you're looking at a PTE inside it, the page table you're looking at has been removed, and a new page table has been installed in its place and populated with PTEs. But removed page tables are always empty, so the only possible consequence of such a race is that you wrongly think that there is no PTE at a given address. The fourth paragraph of the comment block above __pte_offset_map_lock() sorta explains this. > Please expand :)