On Mon, Feb 26, 2024 at 10:28:11AM +0100, David Hildenbrand wrote: > On 23.02.24 01:35, Matthew Wilcox wrote: > > On Thu, Feb 22, 2024 at 03:43:56PM -0800, Elliot Berman wrote: > > > > This creates the situation where access to successfully mmap()'d > > > > memory might SIGBUS at page fault. There is precedence for > > > > similar behavior in the kernel I believe, with MADV_HWPOISON and > > > > the hugetlbfs cgroups controller, which could SIGBUS at page > > > > fault time depending on the accounting limit. > > > > > > I added a test: folio_mmapped() [1] which checks if there's a vma > > > covering the corresponding offset into the guest_memfd. I use this > > > test before trying to make page private to guest and I've been able to > > > ensure that userspace can't even mmap() private guest memory. If I try > > > to make memory private, I can test that it's not mmapped and not allow > > > memory to become private. In my testing so far, this is enough to > > > prevent SIGBUS from happening. > > > > > > This test probably should be moved outside Gunyah specific code, and was > > > looking for maintainer to suggest the right home for it :) > > > > > > [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@xxxxxxxxxxx/ > > > > You, um, might have wanted to send an email to linux-mm, not bury it in > > the middle of a series of 35 patches? > > > > So this isn't folio_mapped() because you're interested if anyone _could_ > > fault this page, not whether the folio is currently present in anyone's > > page tables. > > > > It's like walk_page_mapping() but with a trivial mm_walk_ops; not sure > > it's worth the effort to use walk_page_mapping(), but I would defer to > > David. > > First, I suspect we are not only concerned about current+future VMAs > covering the page, we are also interested in any page pins that could have > been derived from such a VMA? > > Imagine user space mmap'ed the file, faulted in page, took a pin on the page > using pin_user_pages() and friends, and then munmap()'ed the VMA. > > You likely want to catch that as well and not allow a conversion to private? > > [I assume you want to convert the page to private only if you hold all the > folio references -- i.e., if the refcount of a small folio is 1] > Ah, this was something I hadn't thought about. I think both Fuad and I need to update our series to check the refcount rather than mapcount (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me). > > Now, regarding the original question (disallow mapping the page), I see the > following approaches: > > 1) SIGBUS during page fault. There are other cases that can trigger > SIGBUS during page faults: hugetlb when we are out of free hugetlb > pages, userfaultfd with UFFD_FEATURE_SIGBUS. > > -> Simple and should get the job done. > > 2) folio_mmapped() + preventing new mmaps covering that folio > > -> More complicated, requires an rmap walk on every conversion. > > 3) Disallow any mmaps of the file while any page is private > > -> Likely not what you want. > > > Why was 1) abandoned? I looks a lot easier and harder to mess up. Why are > you trying to avoid page faults? What's the use case? > We were chatting whether we could do better than the SIGBUS approach. SIGBUS/FAULT usually crashes userspace, so I was brainstorming ways to return errors early. One difference between hugetlb and this usecase is that running out of free hugetlb pages isn't something we could detect at mmap time. In guest_memfd usecase, we should be able to detect when SIGBUS becomes possible due to memory being lent to guest. I can't think of a reason why userspace would want/be able to resume operation after trying to access a page that it shouldn't be allowed, so SIGBUS is functional. The advantage of trying to avoid SIGBUS was better/easier reporting to userspace. - Elliot