On Mon, Jan 04, 2021 at 09:26:33PM +0000, Nadav Amit wrote: > I would feel more comfortable if you provide patches for uffd-wp. If you > want, I will do it, but I restate that I do not feel comfortable with this > solution (worried as it seems a bit ad-hoc and might leave out a scenario > we all missed or cause a TLB shootdown storm). > > As for soft-dirty, I thought that you said that you do not see a better > (backportable) solution for soft-dirty. Correct me if I am wrong. I think they should use the same technique, since they deal with the exact same challenge. I will try to cleanup the patch in the meantime. I can also try to do the additional cleanups to clear_refs to eliminate the tlb_gather completely since it doesn't gather any page and it has no point in using it. > Anyhow, I will add your comments regarding the stale TLB window to make the > description clearer. Having the mmap_write_lock solution as backup won't hurt, but I think it's only for planB if planA doesn't work and the only stable tree that will have to apply this is v5.9.x. All previous don't need any change in this respect. So there's no worry of rejects. It worked by luck until Aug 2020, but it did so reliably or somebody would have noticed already. And it's not exploitable either, it just works stable, but it was prone to break if the kernel changed in some other way, and it eventually changed in Aug 2020 when an unrelated patch happened to the reuse logic. If you want to maintain the mmap_write_lock patch if you could drop the preserved_write and adjust the Fixes to target Aug 2020 it'd be ideal. The uffd-wp needs a different optimization that maybe Peter is already working on or I can include in the patchset for this, but definitely in a separate commit because it's orthogonal. It's great you noticed the W->RO transition of un-wprotect so we can optimize that too (it will have a positive runtime effect, it's not just theoretical since it's normal to unwrprotect a huge range once the postcopy snapshotting of the virtual machine is complete), I was thinking at the previous case discussed in the other thread. I just don't like to slow down a feature required in the future for implementing postcopy live snapshotting or other snapshots to userland processes (for the non-KVM case, also unprivileged by default if using bounce buffers to feed the syscalls) that can be used by open source hypervisors to beat proprietary hypervisors like vmware. The security concern of uffd-wp that allows to enlarge the window of use-after-free kernel bugs, is not as a concern as it is for regular processes. First the jailer model can obtain the uffd before dropping all caps and before firing up seccomp in the child, so it won't even require to lift the unprivileged_userfaultfd in the superior and cleaner monolithic jailer model. If the uffd and uffd-wp can only run in rust-vmm and qemu, that userland is system software to be trusted as the kernel from the guest point of view. It's similar to fuse, if somebody gets into the fuse process it can also stop the kernel initiated faults. From that respect fuse is also system software despite it runs in userland. In other words I think if there's a vm-escape that takes control of rust-vmm userland, the last worry is the fact it can stop kernel initiated page faults because the jailer took an uffd before drop privs. Thanks, Andrea