Hello everyone, On Fri, Jan 08, 2021 at 12:48:16PM +0000, Will Deacon wrote: > On Thu, Jan 07, 2021 at 04:25:54PM -0800, Linus Torvalds wrote: > > Please. Why is the correct patch not the attached one (apart from the > > obvious fact that I haven't tested it and maybe just screwed up > > completely - but you get the idea)? > > It certainly looks simple and correct to me, although it means we're now > taking the mmap sem for write in the case where we only want to clear the > access flag, which should be fine with the thing only held for read, no? I'm curious, would you also suggest that fixing just the TLB flushing symptom is enough and we can forget about the ABI break coming from page_count used in do_wp_page? One random example: clear_refs will still break all long term GUP pins, are you ok with that too? page_count in do_wp_page is a fix for the original security issue from vmsplice (where the child is fooling the parent in taking the exclusive page in do_wp_page), that appears worse than the bug itself. page_count in do_wp_page, instead of isolating as malicious when the parent is reusing the page queued in the vmsplice pipe, is treating as malicious also all legit cases that had to reliably reuse the page to avoid the secondary MMUs to go out of sync. page_count in do_wp_page is like a credit card provider blocking all credit cards of all customers, because one credit card may have been cloned (by vmsplice), but nobody can know which one was it. Of course this technique will work perfectly as security fix because it will treat all credit card users as malicious and it'll block them all ("block as in preventing re-use of the anon page"). The problem are those other credit card users that weren't malicious that get their COW broken too. Those are the very long term GUP pins if any anon page can be still wrprotected anywhere in the VM. At the same time the real hanging fruit (vmsplice) that, if taken care of, would have rendered the bug purely theoretical in security terms hasn't been fixed yet, despite those unprivileged long term GUP pins causes more reproducible security issues than just the COW, since they can still DoS the OOM killer and they bypass at least the mlock enforcement, even for non compound pages. Of course just fixing vmsplice to require some privilege won't fix the bug in full, so it's not suitable long term solution, but it has to happen orthogonality for other reason, and it'd at least remove the short term security concern. In addition you're not experiencing the full fallout of the side effects of page_count used to decide if to re-use all anon COW pages because the bug is still there (with enterprise default config options at least). Not all credit cards are blocked yet with only 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 applied. Only after you will block them all, you will experience all the side effects of replacing the per-subpage finegrined mapcount with the compound-wide page count. The two statements above combined, result in my recommendation at this point to resolve this in userland by rendering the security issue theoretical by removing vmsplice from the OCI schema allowlist or by enforcing it fixing in userland by always using execve after drop privs (as crun always does when it starts the container of course). For the long term, I can't see how using page_count in do_wp_page is a tenable proposition, unless we either drop all secondary MMUs from the kernel or VM features like clear_refs are dropped or unless the page_count is magically stabilized and the speculative pagecache lookups are also dropped. If trying to manage the fallout by enforcing no anon page can ever be wrprotected in place (i.e. dropping clear_refs feature or rendering it unreliable by skipping elevated counts caused by spurious pagecache lookups), it'd still sounds a too fragile design and too prone to break to rely on that. There's random arch stuff even wrprotecting memory, even very vm86 does it under the hood (vm86 is unlikely it has a long term GUP pin on it of course, but still who knows?). I mean the VM core cannot make assumptions like: "this vm86 case can still wrprotect without worry because probably vm86 isn't used anymore with any advanced secondary MMU, so if there's a GUP pin it probably is a malicious vmsplice and not a RDMA or GPU or Virtualization secondary MMU". Then there's the secondary concern of the inefficiency it introduces with extra unnecessary copies when a single GUP pin will prevent reuse of 512 or 262144 subpages, in the 512 case also potentially mapped in different processes. The TLB flushing discussions registers as the last concern in my view. Thanks, Andrea