On Fri, Jan 08, 2021 at 09:36:49AM -0400, Jason Gunthorpe wrote: > On Thu, Jan 07, 2021 at 04:45:33PM -0500, Andrea Arcangeli wrote: > > On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote: > > > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > > > > > > > vmsplice syscall API is insecure allowing long term GUP PINs without ^^^^^^^^^ > > > > privilege. > > > > > > Lots of places are relying on pin_user_pages long term pins of memory, > > > and cannot be converted to notifiers. > > > > > > I don't think it is reasonable to just declare that insecure and > > > requires privileges, it is a huge ABI break. > > > > Where's that ABI? Are there specs or a code example in kernel besides > > vmsplice itself? > > If I understand you right, you are trying to say that the 193 > pin_user_pages() callers cannot exist as unpriv any more? 193, 1k 1m or their number in general, won't just make them safe... > The majority cannot be converted to notifiers because they are DMA > based. Every one of those is an ABI for something, and does not expect > extra privilege to function. It would be a major breaking change to > have pin_user_pages require some cap. ... what makes them safe is to be transient GUP pin and not long term. Please note the "long term" in the underlined line. O_DIRECT is perfectly ok to be unprivileged obviously. The VM can wait, eventually it goes away. Even a swapout is not an instant event and can be hold off by any number of other things besides a transient GUP pin. It can be hold off by PG_lock just to make an example. mlock however is long term, persistent, vmsplice takes persistent and can pin way too much memory for each mm, that doesn't feel safe. The more places doing stuff like that, the more likely one causes a safety issue, not the other way around it in fact. > > The whole zygote issue wouldn't even register if the child had the > > exact same credentials of the parent. Problem is the child dropped > > privileges and went with a luser id, that clearly cannot ptrace the > > parent, and so if long term unprivileged GUP pins are gone from the > > equation, what remains that the child can do is purely theoretical > > even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f. > > Sorry, I'm not sure I've found a good explanation how ptrace and GUP > are interacting to become a security problem. ptrace is not involved. What I meant by mentioning ptrace, is that if the child can ptrace the parent, then it doesn't matter if it can also do the below, so the security concern is zero in such case. With O_DIRECT or any transient pin you will never munmap while O_DIRECT is in flight, if you munmap it's undefined what happens in such case anyway. It is a theoretical security issue made practical by vmsplice API that allows to enlarge the window to years of time (not guaranteed milliseconds), to wait for the parent to trigger the wp_page_reuse. Remove vmsplice and the security issue in theory remains, but removed vmsplice it becomes irrelevant statistically speaking in practice. io_uring has similar concern but it can use mmu notifier, so it can totally fix it and be 100% safe from this. The scheduler disclosure date was 2020-08-25 so I can freely explain the case that motivated all these changes. case A) if !fork() { // in child mmap one page vmsplice takes gup pin long term on such page munmap one page // mapcount == 1 (parent mm) // page_count == 2 (gup in child, and parent mm) } else { parent writes to the page // mapcount == 1, wp_page_reuse } parent did a COW with mapcount == 1 so the parent will take over a page that is still GUP pinned in the child. That's the security issue because in this case the GUP pin is malicious. Now imagine this case B) mmap one page RDMA or any secondary MMU takes a long term GUP pin munmap one page // mapcount == 1 (parent mm) // page_count == 2 (gup in RDMA, and parent mm) How does the VM can tell between the two different cases? It can't. The current page_count in do_wp_page treats both cases the same and because page_count is 2 in both cases, it calls wp_page_copy in both cases breaking-COW in both cases. However, you know full well in the second case it is a feature and not a bug, that wp_page_reuse is called instead, and in fact it has to be called or it's a bug (and that's the bug page_count in do_wp_page introduces). So page_count in do_wp_page is breaking all valid users, to take care of the purely theoretical security issue that isn't a practical concern if only vmsplice is secured at least as good as mlock. page_count in do_wp_page is fundamentally flawed for all long term GUP pin done by secondary MMUs attached to the memory. The fix in 17839856fd588f4ab6b789f482ed3ffd7c403e1f had to work by triggering a GUP(write=1), that would break-COW while vmsplice runs, in turn fully resolving the security concern, but without breaking your very important case B. > 17839 makes sense to me, and read-only GUP has been avoided by places > like RDMA and others for a very long time because of these issues, > adding the same idea to the core code looks OK. Yes I acked 17839856fd588f4ab6b789f482ed3ffd7c403e1f since it looked the cleanest solution to take care of the purely theoretical security issue (purely theoretical after vmsplice is taken care of). I planned today to look what didn't work exactly in 17839856fd588f4ab6b789f482ed3ffd7c403e1f that may have required to move to 09854ba94c6aad7886996bfbee2530b3d8a7f4f4, it was an huge email thread and I was too busy with urgent work at the time. > The semantics we discussed during the COW on fork thread for pin user > pages were, more or less, that once pinned a page should not be > silently removed from the mm it is currently in by COW or otherwise in > the kernel. I don't get what you mean here. Could you elaborate? > So maybe ptrace should not be COW'ing pinned pages at all, as that is > exactly the same kind of silent corruption fork was causing. ptrace isn't involved, details above. Could you elaborate also if fork started corrupting with 17839856fd588f4ab6b789f482ed3ffd7c403e1f applied? In which commit exactly the corruption started. In general fork(), unless you copy all GUP pinned pages and you don't wrprotect them in fork(), must be handled by blocking all writes on the RDMA region in the parent, then you fork, only after child did the exec you're allowed to unblock the writes in the parent that holds the GUP long term pins. I don't see a notable difference from page_count or mapcount in do_wp_page in this respect: only copying in fork() if the page is pinned like I was also proposed here https://lkml.kernel.org/r/20090311165833.GI27823@random.random will also prevent having to block the writes until exec is run though. FWIW I obviously agree in copying in fork any pinned page, but that was supposed to be an orthogonal improvement and it wasn't supposed to fix a recent regression (the fork vs thread vs gup race always existed, and the need of stopping writes in between fork and exec also). Thanks, Andrea