Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Thu, 7 Jan 2021 16:45:33 -0500

On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote:
> 
> > vmsplice syscall API is insecure allowing long term GUP PINs without
> > privilege.
> 
> Lots of places are relying on pin_user_pages long term pins of memory,
> and cannot be converted to notifiers.
> 
> I don't think it is reasonable to just declare that insecure and
> requires privileges, it is a huge ABI break.

Where's that ABI? Are there specs or a code example in kernel besides
vmsplice itself?

I don't see how it's possible to consider long term GUP pins
completely unprivileged if not using mmu notifier. vmsplice doesn't
even account them in rlimit (it cannot because it cannot identify all
put_pages either).

Long term GUP pins not using mmu notifier and not accounted in rlimit
are an order of magnitude more VM-intrusive than mlock.

The reason it's worse than mlock, even if ignore all performance
feature that they break including numa bindings and that mlock doesn't
risk to break, come because you can unmap the memory after taking
those rlimit unaccounted GUP pins. So the OOM killer won't even have a
chance to see the GUP pins coming.

So it can't be that mlock has to be privileged but unconstrainted
unaccounted long term GUP pins as in vmsplice are ok to stay
unprivileged.

Now io_uring does account the GPU pins in the mlock rlimit, but after
the vma is unmapped it'd still cause the same confusion to OOM killer
and in addition the assumption that each GUP pin cost 4k is also
flawed. However io_uring model can use the mmu notifier without
slowdown to the fast paths, so it's not going to cause any ABI break
to fix it.

Or to see it another way, it'd be fine to declare all mlock rlimits
are obsolete and memcg is the only way to constrain RAM usage, but
then mlock should stop being privileged, because mlock is a lesser
concern and it won't risk to confuse the OOM killer at least.

The good thing is the GUP pins won't escape memcg accounting but that
accounting also doesn't come entirely free.

> FWIW, vhost tries to use notifiers as a replacement for GUP, and I
> think it ended up quite strange and complicated. It is hard to
> maintain performance when every access to the pages needs to hold some
> protection against parallel invalidation.

And that's fine, this is all about if it should require a one liner
change to add the username in the realtime group in /etc/group or not.

You're focusing on your use case, but we've to put things in
prospective of all these changes started.

The whole zygote issue wouldn't even register if the child had the
exact same credentials of the parent. Problem is the child dropped
privileges and went with a luser id, that clearly cannot ptrace the
parent, and so if long term unprivileged GUP pins are gone from the
equation, what remains that the child can do is purely theoretical
even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f.

NOTE: I'm all for fixing the COW for good, but vmsplice or any long
term GUP pin that is absolutely required to make such attack
practical, looks the real low hanging fruit here to fix.

However fixing it so clear_refs becomes fundamentally incompatible
with mmu notifier users unless they all convert to pure !FOLL_GET
GUPs, let alone long term GUP pins not using mmu notifier, doesn't
look great. For vmsplice that new break-COW is the fix because it
happens in the other process.

For every legit long term GUP, where the break-COW happens in the
single and only process, it's silent MM corruption.

Thanks,
Andrea