On Thu, Feb 2, 2023 at 1:56 AM David Stevens <stevensd@xxxxxxxxxxxx> wrote: > > On Thu, Feb 2, 2023 at 5:52 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > On Wed, Feb 01, 2023 at 09:36:37AM -0800, Yang Shi wrote: > > > On Tue, Jan 31, 2023 at 7:42 PM David Stevens <stevensd@xxxxxxxxxxxx> wrote: > > > > > > > > From: David Stevens <stevensd@xxxxxxxxxxxx> > > > > > > > > Collapsing memory in a vma that has an armed userfaultfd results in > > > > zero-filling any missing pages, which breaks user-space paging for those > > > > filled pages. Avoid khugepage bypassing userfaultfd by not collapsing > > > > pages in shmem reached via scanning a vma with an armed userfaultfd if > > > > doing so would zero-fill any pages. > > > > > > > > Signed-off-by: David Stevens <stevensd@xxxxxxxxxxxx> > > > > --- > > > > mm/khugepaged.c | 35 ++++++++++++++++++++++++----------- > > > > 1 file changed, 24 insertions(+), 11 deletions(-) > > > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > > index 79be13133322..48e944fb8972 100644 > > > > --- a/mm/khugepaged.c > > > > +++ b/mm/khugepaged.c > > > > @@ -1736,8 +1736,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, > > > > * + restore gaps in the page cache; > > > > * + unlock and free huge page; > > > > */ > > > > -static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > > - struct file *file, pgoff_t start, > > > > +static int collapse_file(struct mm_struct *mm, struct vm_area_struct *vma, > > > > + unsigned long addr, struct file *file, pgoff_t start, > > > > struct collapse_control *cc) > > > > { > > > > struct address_space *mapping = file->f_mapping; > > > > @@ -1784,6 +1784,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, > > > > * be able to map it or use it in another way until we unlock it. > > > > */ > > > > > > > > + if (is_shmem) > > > > + mmap_read_lock(mm); > > > > > > If you release mmap_lock before then reacquire it here, the vma is not > > > trusted anymore. It is not safe to use the vma anymore. > > > > > > Since you already read uffd_was_armed before releasing mmap_lock, so > > > you could pass it directly to collapse_file w/o dereferencing vma > > > again. The problem may be false positive (not userfaultfd armed > > > anymore), but it should be fine. Khugepaged could collapse this area > > > in the next round. > > I didn't notice this race condition. It should be possible to adapt > hugepage_vma_revalidate for this situation, or at least to create an > analogous situation. But once you release mmap_lock, the vma still may be changed, revalidation just can guarantee the vma is valid when you hold the mmap_lock unless mmap_lock is held for the whole collapse or at some point that the collapse doesn't have impact on userfaultfd anymore. We definitely don't want to hold mmap_lock for the whole collapse, but I don't know whether we could release it earlier or not due to my limited knowledge of userfaultfd. > > > Unfortunately that may not be enough.. because it's also possible that it > > reads uffd_armed==false, released mmap_sem, passed it over to the scanner, > > but then when scanning the file uffd got armed in parallel. > > > > There's another problem where the current vma may not have uffd armed, > > khugepaged may think it has nothing to do with uffd and moved on with > > collapsing, but actually it's armed in another vma of either the current mm > > or just another mm's. > > > > It seems non-trivial too to safely check this across all the vmas, let's > > say, by a reverse walk - the only safe way is to walk all the vmas and take > > the write lock for every mm, but that's not only too heavy but also merely > > impossible to always make it right because of deadlock issues and on the > > order of mmap write lock to take.. > > > > So far what I can still think of is, if we can extend shmem_inode_info and > > have a counter showing how many uffd has been armed. It can be a generic > > counter too (e.g. shmem_inode_info.collapse_guard_counter) just to avoid > > the page cache being collapsed under the hood, but I am also not aware of > > whether it can be reused by other things besides uffd. > > > > Then when we do the real collapsing, say, when: > > > > xas_set_order(&xas, start, HPAGE_PMD_ORDER); > > xas_store(&xas, hpage); > > xas_unlock_irq(&xas); > > > > We may need to make sure that counter keeps static (probably by holding > > some locks during the process) and we only do that last phase collapse if > > counter==0. > > > > Similar checks in this patch can still be done, but that'll only service as > > a role of failing faster before the ultimate check on the uffd_armed > > counter. Otherwise I just don't quickly see how to avoid race conditions. > > I don't know if it's necessary to go that far. Userfaultfd plus shmem > is inherently brittle. It's possible for userspace to bypass > userfaultfd on a shmem mapping by accessing the shmem through a > different mapping or simply by using the write syscall. It might be > sufficient to say that the kernel won't directly bypass a VMA's > userfaultfd to collapse the underlying shmem's pages. Although on the > other hand, I guess it's not great for the presence of an unused shmem > mapping lying around to cause khugepaged to have user-visible side > effects. > > -David > > > It'll be great if someone can come up with something better than above.. > > Copy Hugh too. > > > > Thanks, > > > > -- > > Peter Xu > >