Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

Dan Williams <dan.j.williams@xxxxxxxxx> · Tue, 23 Feb 2021 17:32:16 -0800

On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>
> On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote:
> > [ add Ralph ]
> >
> > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> > >
> > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote:
> > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> > > > >
> > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
> > > > >
> > > > > > > The downside would be one extra lookup in dev_pagemap tree
> > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > > > > > per gup-fast() call.
> > > > > >
> > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > > > > > path. It should be measurable that this change is at least as fast or
> > > > > > faster than falling back to the slow path, but it would be good to
> > > > > > measure.
> > > > >
> > > > > What is the dev_pagemap thing doing in gup fast anyhow?
> > > > >
> > > > > I've been wondering for a while..
> > > >
> > > > It's there to synchronize against dax-device removal. The device will
> > > > suspend removal awaiting all page references to be dropped, but
> > > > gup-fast could be racing device removal. So gup-fast checks for
> > > > pte_devmap() to grab a live reference to the device before assuming it
> > > > can pin a page.
> > >
> > > From the perspective of CPU A it can't tell if CPU B is doing a HW
> > > page table walk or a GUP fast when it invalidates a page table. The
> > > design of gup-fast is supposed to be the same as the design of a HW
> > > page table walk, and the tlb invalidate CPU A does when removing a
> > > page from a page table is supposed to serialize against both a HW page
> > > table walk and gup-fast.
> > >
> > > Given that the HW page table walker does not do dev_pagemap stuff, why
> > > does gup-fast?
> >
> > gup-fast historically assumed that the 'struct page' and memory
> > backing the page-table walk could not physically be removed from the
> > system during its walk because those pages were allocated from the
> > page allocator before being mapped into userspace.
>
> No, I'd say gup-fast assumes that any non-special PTE it finds in a
> page table must have a struct page.
>
> If something wants to remove that struct page it must first remove all
> the PTEs pointing at it from the entire system and flush the TLBs,
> which directly prevents a future gup-fast from running and trying to
> access the struct page. No extra locking needed
>
> > implied elevated reference on any page that gup-fast would be asked to
> > walk, or pte_special() is there to "say wait, nevermind this isn't a
> > page allocator page fallback to gup-slow()".
>
> pte_special says there is no struct page, and some of those cases can
> be fixed up in gup-slow.
>
> > > Can you sketch the exact race this is protecting against?
> >
> > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and
> > issues direct I/O with that mapping as the target buffer, Thread2 does
> > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without
> > the dev_pagemap check reference gup-fast could execute
> > get_page(pte_page(pte)) on a page that doesn't even exist anymore
> > because the driver unbind has already performed remove_pages().
>
> Surely the unbind either waits for all the VMAs to be destroyed or
> zaps them before allowing things to progress to remove_pages()?

If we're talking about device-dax this is precisely what it does, zaps
and prevents new faults from resolving, but filesystem-dax...

> Having a situation where the CPU page tables still point at physical
> pages that have been removed sounds so crazy/insecure, that can't be
> what is happening, can it??

Hmm, that may be true and an original dax bug! The unbind of a
block-device from underneath the filesystem does trigger the
filesystem to emergency shutdown / go read-only, but unless that
process also includes a global zap of all dax mappings not only is
that violating expectations of "page-tables to disappearing memory",
but the filesystem may also want to guarantee that no further dax
writes can happen after shutdown. Right now I believe it only assumes
that mmap I/O will come from page writeback so there's no need to
bother applications with mappings to page cache, but dax mappings need
to be ripped away.

/me goes to look at what filesytems guarantee when the block-device is
surprise removed out from under them.

In any event, this accelerates the effort to go implement
fs-global-dax-zap at the request of the device driver.