On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote: > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > > > vmsplice syscall API is insecure allowing long term GUP PINs without > > privilege. > > Lots of places are relying on pin_user_pages long term pins of memory, > and cannot be converted to notifiers. > > I don't think it is reasonable to just declare that insecure and > requires privileges, it is a huge ABI break. Where's that ABI? Are there specs or a code example in kernel besides vmsplice itself? I don't see how it's possible to consider long term GUP pins completely unprivileged if not using mmu notifier. vmsplice doesn't even account them in rlimit (it cannot because it cannot identify all put_pages either). Long term GUP pins not using mmu notifier and not accounted in rlimit are an order of magnitude more VM-intrusive than mlock. The reason it's worse than mlock, even if ignore all performance feature that they break including numa bindings and that mlock doesn't risk to break, come because you can unmap the memory after taking those rlimit unaccounted GUP pins. So the OOM killer won't even have a chance to see the GUP pins coming. So it can't be that mlock has to be privileged but unconstrainted unaccounted long term GUP pins as in vmsplice are ok to stay unprivileged. Now io_uring does account the GPU pins in the mlock rlimit, but after the vma is unmapped it'd still cause the same confusion to OOM killer and in addition the assumption that each GUP pin cost 4k is also flawed. However io_uring model can use the mmu notifier without slowdown to the fast paths, so it's not going to cause any ABI break to fix it. Or to see it another way, it'd be fine to declare all mlock rlimits are obsolete and memcg is the only way to constrain RAM usage, but then mlock should stop being privileged, because mlock is a lesser concern and it won't risk to confuse the OOM killer at least. The good thing is the GUP pins won't escape memcg accounting but that accounting also doesn't come entirely free. > FWIW, vhost tries to use notifiers as a replacement for GUP, and I > think it ended up quite strange and complicated. It is hard to > maintain performance when every access to the pages needs to hold some > protection against parallel invalidation. And that's fine, this is all about if it should require a one liner change to add the username in the realtime group in /etc/group or not. You're focusing on your use case, but we've to put things in prospective of all these changes started. The whole zygote issue wouldn't even register if the child had the exact same credentials of the parent. Problem is the child dropped privileges and went with a luser id, that clearly cannot ptrace the parent, and so if long term unprivileged GUP pins are gone from the equation, what remains that the child can do is purely theoretical even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f. NOTE: I'm all for fixing the COW for good, but vmsplice or any long term GUP pin that is absolutely required to make such attack practical, looks the real low hanging fruit here to fix. However fixing it so clear_refs becomes fundamentally incompatible with mmu notifier users unless they all convert to pure !FOLL_GET GUPs, let alone long term GUP pins not using mmu notifier, doesn't look great. For vmsplice that new break-COW is the fix because it happens in the other process. For every legit long term GUP, where the break-COW happens in the single and only process, it's silent MM corruption. Thanks, Andrea