Re: [PATCH 6.6] fork: defer linking file vma until vma is fully initialized

Axel Rasmussen <axelrasmussen@xxxxxxxxxx> · Tue, 16 Jul 2024 09:58:34 -0700

On Tue, Jul 16, 2024 at 9:08 AM Alex Williamson
<alex.williamson@xxxxxxxxxx> wrote:
>
> On Mon, 15 Jul 2024 18:06:25 -0700
> Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote:
>
> > On Mon, Jul 15, 2024 at 3:21 PM Alex Williamson
> > <alex.williamson@xxxxxxxxxx> wrote:
> > >
> > > On Mon, 15 Jul 2024 13:35:41 -0700
> > > Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote:
> > >
> > > > I tried out Sasha's suggestion. Note that *just* taking
> > > > aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") is not sufficient, we also
> > > > need b7c5e64fec ("vfio: Create vfio_fs_type with inode per device").
> > > >
> > > > But, the good news is both of those apply more or less cleanly to 6.6. And, at
> > > > least under a very basic test which exercises VFIO memory mapping, things seem
> > > > to work properly with that change.
> > > >
> > > > I would agree with Leah that these seem a bit big to be stable fixes. But, I'm
> > > > encouraged by the fact that Sasha suggested taking them. If there are no big
> > > > objections (Alex? :) ) I can send the backport patches this week.
> > > >
> > >
> > > If you were to take those, I think you'd also want:
> > >
> > > d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault")
> > >
> > > which helps avoid a potential regression in VM startup latency vs
> > > faulting each page of the VMA.  Ideally we'd have had huge_fault
> > > working for pfnmaps before this conversion to avoid the latter commit.
> > >
> > > I'm a bit confused by the lineage here though, 35e351780fa9 ("fork:
> > > defer linking file vma until vma is fully initialized") entered v6.9
> > > whereas these vfio changes all came in v6.10, so why does the v6.6
> > > backport end up with dependencies on these newer commits?  Is there
> > > something that needs to be fixed in v6.9-stable as well?
> >
> > Right, I believe 35e351780fa9 introduced a bug for VFIO by calling
> > vm_ops->open() *before* copy_page_range(). So I think this bug affects
> > not just 6.6 (to which 35e351780fa9 was stable backported) but also
> > 6.9 as you say.
> >
> > The reason to bring up all these newer commits is, it's unclear how to
> > fix the bug. :) We thought we had a simple solution to just reorder
> > when vm_ops->open() is called, but Miaohe pointed out elsewhere in
> > this thread an issue with doing that.
> >
> > Assuming the reordering is unworkable, the only other idea I have for
> > fixing the bug without the larger refactor is:
> >
> > 1. Mark VFIO VMAs VM_WIPEONFORK so we don't copy_page_range after
> > vm_ops->open() is called
> > 2. Remove the WARN_ON_ONCE(1) in get_pat_info() so when VFIO zaps a
> > not-fully-populated range (expected if we never copy_page_range!) we
> > don't get a warning
> >
> > There are downsides to this fix. It's kind of abusing VM_WIPEONFORK
> > for a new purpose. It's removing a warning which may catch other
> > legitimate problems. And it's diverging stable kernels from upstream
> > as Sasha points out.
> >
> > Just backporting the refactors fixes (well, totally avoids) the bug,
> > and it doesn't require special hackery only for stable kernels.
>
> Yes, I'd agree that we want to stay as close as possible to the current
> upstream solution, even if we got there pretty haphazardly.  Therefore
> it sounds like we should queue the following for v6.9-stable:
>
> d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault")
> aac6db75a9fc ("vfio/pci: Use unmap_mapping_range()")
> b7c5e64fecfa ("vfio: Create vfio_fs_type with inode per device")
>
> And then anywhere that 35e351780fa9 ("fork: defer linking file vma
> until vma is fully initialized") gets backported, those will also need
> to follow.

Sounds good to me. I can send these patches for 6.9 and then 6.6.

>
> Did anyone report an issue with 35e351780fa9 and vfio on v6.9 or the
> previous v6.6 backport to use as a test case or do we just know it's an
> issue from inspection?  The revert only notes an xfstest issue.  Thanks,

I'm not aware of any reports of this, besides our own detection internally.

We originally noticed via xfstests the failure mode where we call
copy_page_range, so underneath untrack_pfn we find a 'hole' in the
mapping so we WARN. A fair question is, why does running xfstests
involve exercising vfio-pci? :) Internally our test machines use
vfio-pci for other reasons, xfstests is an innocent bystander here. We
just happened to trigger this WARN while xfstests was running, so it
noticed + reported the WARN in the test results.

Since that repro is specific to our test machine setup, it
unfortunately isn't an easily shareable regression test. :/

>
> Alex
>