Re: getting rid of the last memory modifitions through gup(FOLL_GET)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 08.09.23 10:15, Christoph Hellwig wrote:
On Wed, Sep 06, 2023 at 11:42:33AM +0200, David Hildenbrand wrote:
and iov_iter_get_pages_alloc2.  We have three file system direct I/O
users of those left: ceph, fuse and nfs.  Lei Huang has sent patches
to convert fuse to iov_iter_extract_pages which I'd love to see merged,
and we'd need equivalent work for ceph and nfs.

The non-file system uses are in the vmsplice code, which only reads

vmsplice really has to be fixed to specify FOLL_PIN|FOLL_LONGTERM for good;
I recall that David Howells had patches for that at one point. (at least to
use FOLL_PIN)

Hmm, unless I'm misreading the code vmsplace is only using
iov_iter_get_pages2 for reading from the user address space anyway.
Or am I missing something?

It's not relevant for the case you're describing here ("last memory modifitions through gup(FOLL_GET)").

vmsplice_to_pipe() -> iter_to_pipe() -> iov_iter_get_pages2()

So it ends up calling get_user_pages_fast()

... and not using FOLL_PIN|FOLL_LONGTERM

Why FOLL_LONGTERM? Because it's a longterm pin, where unprivileged users can grab a reference on a page for all eternity, breaking CMA and memory hotunplug (well, and harming compaction).

Why FOLL_PIN? Well FOLL_LONGTERM only applies to FOLL_PIN. But for anonymous memory, this will also take care of the last remaining hugetlb COW test (trigger COW unsharing) as commented back in:

https://lore.kernel.org/all/02063032-61e7-e1e5-cd51-a50337405159@xxxxxxxxxx/



After that we might have to do an audit of the raw get_user_pages APIs,
but there probably aren't many that modify file backed memory.

ptrace should apply that ends up doing a FOLL_GET|FOLL_WRITE.

Yes, if that ends up on file backed shared mappings we also need a pin.

See below.


Further, KVM ends up using FOLL_GET|FOLL_WRITE to populate the second-level
page tables for VMs, and uses MMU notifiers to synchronize the second-level
page tables with process page table changes. So once a PTE goes from
writable -> r/o in the process page table, the second level page tables for
the VM will get updated. Such MMU users are quite different from ordinary
GUP users.

Can KVM page tables use file backed shared mappings?

Yes, usually shmem and hugetlb. But with things like emulated NVDIMMs/virtio-pmem for VMs, easily also ordinary files.

But it's really not ordinary write access through GUP. It's write access via a secondary page table (secondary MMU), that's synchronized to the process page table -- just like if the CPU would be writing to the page using the process page tables (primary MMU).


Converting ptrace might not be desired/required as well (the reference is
dropped immediately after the read/write access).

But the pin is needed to make sure the file system can account for
dirtying the pages.  Something we fundamentally can't do with get.

ptrace will find the pagecache page writable in the page table (PTE write bit set), if it intends to write to the page (FOLL_WRITE). If it is not writable, it will trigger a page fault that informs the file system.

With an FS that wants writenotify, we will not map a page writable (PTE write bit not set) unless it is dirty (PTE dirty bit set) IIRC.

So are we concerned about a race between the filesystem removing the PTE write bit (to catch next write access before it gets dirtied again) and ptrace marking the page dirty?

It's a very, very small race window, staring at __access_remote_vm(). But it should apply if that's the concern.


The end goal as discussed a couple of times would be the to limit FOLL_GET
in general only to a couple of users that can be audited and keep using it
for a good reason. Arbitrary drivers that perform DMA should stop using it
(and ideally be prevented from using it) and switch to FOLL_PIN.

Agreed, that's where I'd like to get to.  Preferably with the non-pin
API not even beeing epxorted to modules.

Yes. However, secondary MMU users (like KVM) would need some way to keep making use of that; ideally, using a proper separate interface instead of (ab)using plain GUP and confusing people :)

[1] https://lkml.org/lkml/2023/1/24/451

--
Cheers,

David / dhildenb




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux