On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote: > > [ add Ralph ] > > > > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > > > > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote: > > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > > > > > > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > > > > > > > > > The downside would be one extra lookup in dev_pagemap tree > > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > > > > > per gup-fast() call. > > > > > > > > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > > > > > path. It should be measurable that this change is at least as fast or > > > > > > faster than falling back to the slow path, but it would be good to > > > > > > measure. > > > > > > > > > > What is the dev_pagemap thing doing in gup fast anyhow? > > > > > > > > > > I've been wondering for a while.. > > > > > > > > It's there to synchronize against dax-device removal. The device will > > > > suspend removal awaiting all page references to be dropped, but > > > > gup-fast could be racing device removal. So gup-fast checks for > > > > pte_devmap() to grab a live reference to the device before assuming it > > > > can pin a page. > > > > > > From the perspective of CPU A it can't tell if CPU B is doing a HW > > > page table walk or a GUP fast when it invalidates a page table. The > > > design of gup-fast is supposed to be the same as the design of a HW > > > page table walk, and the tlb invalidate CPU A does when removing a > > > page from a page table is supposed to serialize against both a HW page > > > table walk and gup-fast. > > > > > > Given that the HW page table walker does not do dev_pagemap stuff, why > > > does gup-fast? > > > > gup-fast historically assumed that the 'struct page' and memory > > backing the page-table walk could not physically be removed from the > > system during its walk because those pages were allocated from the > > page allocator before being mapped into userspace. > > No, I'd say gup-fast assumes that any non-special PTE it finds in a > page table must have a struct page. > > If something wants to remove that struct page it must first remove all > the PTEs pointing at it from the entire system and flush the TLBs, > which directly prevents a future gup-fast from running and trying to > access the struct page. No extra locking needed > > > implied elevated reference on any page that gup-fast would be asked to > > walk, or pte_special() is there to "say wait, nevermind this isn't a > > page allocator page fallback to gup-slow()". > > pte_special says there is no struct page, and some of those cases can > be fixed up in gup-slow. > > > > Can you sketch the exact race this is protecting against? > > > > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and > > issues direct I/O with that mapping as the target buffer, Thread2 does > > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without > > the dev_pagemap check reference gup-fast could execute > > get_page(pte_page(pte)) on a page that doesn't even exist anymore > > because the driver unbind has already performed remove_pages(). > > Surely the unbind either waits for all the VMAs to be destroyed or > zaps them before allowing things to progress to remove_pages()? If we're talking about device-dax this is precisely what it does, zaps and prevents new faults from resolving, but filesystem-dax... > Having a situation where the CPU page tables still point at physical > pages that have been removed sounds so crazy/insecure, that can't be > what is happening, can it?? Hmm, that may be true and an original dax bug! The unbind of a block-device from underneath the filesystem does trigger the filesystem to emergency shutdown / go read-only, but unless that process also includes a global zap of all dax mappings not only is that violating expectations of "page-tables to disappearing memory", but the filesystem may also want to guarantee that no further dax writes can happen after shutdown. Right now I believe it only assumes that mmap I/O will come from page writeback so there's no need to bother applications with mappings to page cache, but dax mappings need to be ripped away. /me goes to look at what filesytems guarantee when the block-device is surprise removed out from under them. In any event, this accelerates the effort to go implement fs-global-dax-zap at the request of the device driver.