On Fri, Nov 15, 2024 at 08:04:52PM +0100, Jann Horn wrote: > On Fri, Nov 15, 2024 at 7:02 PM Lorenzo Stoakes > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > On Thu, Nov 14, 2024 at 10:12:00PM +0100, Jann Horn wrote: > > > Make it clearer that holding the mmap lock in read mode is not enough > > > to traverse page tables, and that just having a stable VMA is not enough > > > to read PTEs. > > > > > > Suggested-by: Matteo Rizzo <matteorizzo@xxxxxxxxxx> > > > Signed-off-by: Jann Horn <jannh@xxxxxxxxxx> > > > > Have some queries before we move forward so would like a little more > > clarification/perhaps putting some extra meat on the bones first. > > > > Broadly very glad you have done this however so it's just sorting details > > first! :>) > > > > > --- > > > @akpm: Please don't put this in your tree before Lorenzo has replied. > > > > > > @Lorenzo: > > > This is intended to go on top of your documentation patch. > > > If you think this is a sensible change, do you prefer to squash it into > > > your patch or do you prefer having akpm take this as a separate patch? > > > IDK what works better... > > > > I think a new patch is better, as I'd like the original to settle down now > > and the whole point of this doc is that it's a living thing that many > > people can contribute to, update, etc. > > > > For instance, Suren is updating as part of one of his series to correct > > things that he changes in that series, which is really nice. > > > > > --- > > > Documentation/mm/process_addrs.rst | 21 +++++++++++++++++++-- > > > 1 file changed, 19 insertions(+), 2 deletions(-) > > > > > > diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst > > > index 1bf7ad010fc063d003bb857bb3b695a3eafa0b55..9bdf073d0c3ebea1707812508a309aa4a6163660 100644 > > > --- a/Documentation/mm/process_addrs.rst > > > +++ b/Documentation/mm/process_addrs.rst > > > @@ -339,6 +339,16 @@ When **installing** page table entries, the mmap or VMA lock must be held to > > > keep the VMA stable. We explore why this is in the page table locking details > > > section below. > > > > > > +.. warning:: Taking the mmap lock in read mode **is not sufficient** for > > > + traversing page tables; you must also ensure that a VMA exists that > > > + covers the range being accessed. > > > > Hm, but we say later we don't need _any_ locks for traversal, and here we > > say we need mmap read lock. Do you mean installing page table entries? > > > > Or do you mean to say, that if you don't span a VMA, you must acquire a > > write lock at least to preclude this? > > Yeah, this is what I meant with the "you must also ensure that a VMA > exists" part. (Context is that I was looking at some non-upstream code > that was trying to do exactly this - take a userspace-supplied address > and walk down the page tables at that address. Upstream also once had > a UAF in pagemap_walk() because walk_page_range() was used while > holding only the mmap read lock, see > <https://project-zero.issues.chromium.org/42451485>.) > > > This seems quite unclear. > > > > I kind of didn't want to touch on the horrors of fiddling about without a > > VMA, so I'd rather this very clearly say something like 'it is unusual to > > manipulate page tables wihch are not spanned by a VMA, and there are > > special requirements for this operation' etc. et.c otherwise this just adds > > more noise and confusion I think. > > I guess maybe we could replace this entire warning block with > something like this? > > ".. warning:: Page tables are normally only traversed in regions > covered by VMAs. If you want to traverse page tables in areas that > might not be covered by VMAs, heavier locking is required. See > :c:func:`!walk_page_range_novma` for details." Yes I prefer this. > > And walk_page_range_novma() already has a comment block talking about > why an mmap read lock isn't enough to traverse user virtual address > space outside VMAs. > > > > + This ensures you can't race with concurrent page table removal > > > + which happens with the mmap lock in read mode, in regions whose > > > + VMAs are no longer present in the VMA tree. > > > + > > > + (Alternatively, the mmap lock can be taken in write mode, but that > > > + is heavy-handed and almost never the right choice.) > > > > You kind of need to expand on why that is I think! > > > > > + > > > **Freeing** page tables is an entirely internal memory management operation and > > > has special requirements (see the page freeing section below for more details). > > > > > > @@ -450,6 +460,9 @@ the time of writing of this document. > > > Locking Implementation Details > > > ------------------------------ > > > > > > +.. warning:: Locking rules for PTE-level page tables are very different from > > > + locking rules for page tables at other levels. > > > + > > > Page table locking details > > > -------------------------- > > > > > > @@ -470,8 +483,12 @@ additional locks dedicated to page tables: > > > These locks represent the minimum required to interact with each page table > > > level, but there are further requirements. > > > > > > -Importantly, note that on a **traversal** of page tables, no such locks are > > > -taken. Whether care is taken on reading the page table entries depends on the > > > +Importantly, note that on a **traversal** of page tables, sometimes no such > > > +locks are taken. However, at the PTE level, at least concurrent page table > > > +deletion must be prevented (using RCU) and the page table must be mapped into > > > +high memory, see below. > > > > Ugh I really do hate that we have to think about high memory. I'd like to > > sort of deny it exists. But I suppose that's not an option. > > (FWIW from a quick look I think only arm and x86 actually have this behavior.) > > I mean... we could try to just make userspace page tables live in > non-high-memory based on the assumption that nobody nowadays is going > to run a 32-bit kernel on a device with significantly more than 4G or > 8G or so of RAM, and probably most memory is used for stuff like > anon/file pages, not for page tables... but I don't think it actually > matters much for code complexity now that we need RCU for page table > access anyway. Right, I just am not a fan of 32-bit kernels :P not your fault! > > > As for the RCU thing, I guess this is why pte_offset_map_lock() is taking > > it? Maybe worth mentioning something there or updating that 'interestingly' > > block... :>) > > Yeah, and not just pte_offset_map_lock() but also pte_offset_map() and > such which don't lock the page table. Could you update that 'interestingly' block then also? > > > Or am I mistaken? I wasn't aware of this requirement, is this sort of > > implied by the gup_fast() IRQ disabling stuff? > > This is to protect against khugepaged/MADV_COLLAPSE deleting page > tables while holding only an rmap lock (in read mode), the pmd lock, > and the pte lock. So a concurrent call to MADV_COLLAPSE (potentially > in another process) can (via retract_page_tables()) remove the page > table while we're traversing it, but the page table will only actually > be freed via RCU (thanks to pte_free_defer), and so this doesn't cause > use-after-free. pte_offset_map_lock() ensures that on success, it has > locked the current page table (holding the pte lock prevents the page > table being removed in the future, and rechecking the pmd entry > ensures it hasn't already been removed), so we never end up installing > PTEs into page tables that have already been removed. pte_offset_map() > does not ensure that you always see the latest page table; it could be > that by the time you're looking at a PTE inside it, the page table > you're looking at has been removed, and a new page table has been > installed in its place and populated with PTEs. But removed page > tables are always empty, so the only possible consequence of such a > race is that you wrongly think that there is no PTE at a given > address. It'd be good to include this detail in the doc also. > > The fourth paragraph of the comment block above > __pte_offset_map_lock() sorta explains this. > > > Please expand :) Thanks!