On Fri, Jan 08, 2021 at 09:50:08PM -0500, Andrea Arcangeli wrote: > For all those that aren't using mmu notifier and that rely solely on > page pins, they still require privilege, except they do through /dev/ > permissions. It is normal that the dev nodes are a+rw so it doesn't really require privilege in any real sense. > Actually the mmu notifier doesn't strictly require pins, it only > requires GUP. All users tend to use FOLL_GET just as a safety > precaution (I already tried to optimize away the two atomics per GUP, > but we were naked by the KVM maintainer that didn't want to take the > risk, I would have, but it's a fair point indeed, obviously it's safer > with the pin plus the mmu notifier, two is safer than one). I'm not sure what holding the pin will do to reduce risk? If you get into a situation where you are stuffing a page into the SMMU that is not in the CPU's MMU then everything is lost. Holding a pin while carrying a page from the CPU page table to the SMMU just ensures that page isn't freed until it is installed, but once installed you are back to being broken. > I'm not sure how any copy-user could obviate a secondary MMU mapping, > mappings and copies are mutually exclusive. Any copy would be breaking > memory coherency in this environment. Because most places need to copy from user to stable kernel memory before processing data under user control. You can't just cast a user controlled pointer to a kstruct and use it - that is very likely a security bug. Still, the general version is something like kmap: map = user_map_setup(user_ptr, length) kptr = user_map_enter(map) [use kptr] user_map_leave(map, kptr) And inside it could use mmu notifiers, or gup, or whatever. user_map_setup() would register the notifier and user_map_enter() would validate the cache'd page pointer and block cached invalidation until user_map_leave(). > The primary concern with the mmu notifier in io_uring is the > take_all_locks latency. Just enabling mmu_notifier takes a performance hit on the entire process too, it is not such a simple decision.. We'd need benchmarks against a database or scientific application to see how negative the notifier actually becomes. > The problem with the mmu notifier as an universal solution, for > example is that it can't wait for I/O completion of O_DIRECT since it > has no clue where the put_page is to wait for it, otherwise we could > avoid even the FOLL_GET for O_DIRECT and guarantee the I/O has to be > completed before paging or anything can unmap the page under I/O from > the pagetable. GPU is already doing something like this, waiting in a notifier invalidate callback for DMA to finish before allowing invalidate to complete. It is horrendously complicated and I'm not sure blocking invalidate for a long time is actually much better for the MM.. > I see the incompatibility you describe as problem we have today, in > the present, and that will fade with time. > > Reminds me when we had >4G of RAM and 32bit devices doing DMA. How > many 32bit devices are there now? I'm not so sure anymore. A few years ago OpenCAPI and PCI PRI seemed like good things, but now with experience they carry pretty bad performance hits to use them. Lots of places are skipping them. CXL offers another chance at this, so we'll see again in another 5 years or so if it works out. It is not any easy problem to solve from a HW perspective. > We're not talking here about any random PCI device, we're talking here > about very special and very advanced devices that need to have "long > term" GUP pins in order to operate, not the usual nvme/gigabit device > where GUP pins are never long term. Beyond RDMA, netdev's XDP uses FOLL_LONGTERM, so do various video devices, lots of things related to virtualization like vfio, vdpa and vhost. I think this is a bit defeatist to say it doesn't matter. If anything as time goes on it seems to be growing, not shrinking currently. > The point is that if you do echo ... >/proc/self/clear_refs on your > pid, that has any FOLL_LONGTERM on its mm, it'll just cause your > device driver to go out of sync with the mm. It'll see the old pages, > before the spurious COWs. The CPU will use new pages (the spurious > COWs). But if you do that then clear-refs isn't going to work they way it thought either - this first needs some explanation for how clear_refs is supposed to work when DMA WRITE is active on the page. I'd certainly say causing a loss of synchrony is not acceptable, so if we keep Linus's version of COW then clear_refs has to not write protect pages under DMA. > > secondary-mmu drivers using mmu notifier should not trigger this logic > > and should not restrict write protect. > > That's a great point. I didn't think the mmu notifier will invalidate > the secondary MMU and ultimately issue a GUP after the wp_copy_page to > keep it in sync. It had better, or mmu notifiers are broken, right? > The funny thing that doesn't make sense is that wp_copy_page will only > be invoked because the PIN was left by KVM on the page for that extra > safety I was talking about earlier. Yes, with the COW change if kvm cares about this inefficiency it should not have the unnecessary pin. > You clearly contemplate the existance of a read mode, long term. That > is also completely compatible with wrprotection. We talked about a read mode, but we didn't flesh it out. It is not unconditionally compatible with wrprotect - most likely you still can't write protect a page under READ DMA because when you eventually take the COW there will be ambiguous situations that will break the synchrony. Jason