Re: [PATCH 00/13] Fix the DAX-gup mistake

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jason Gunthorpe wrote:
> On Sat, Sep 03, 2022 at 07:16:00PM -0700, Dan Williams wrote:
> > tl;dr: Move the pin of 'struct dev_pagemap' instances from gup-time to
> > map time, move the unpin of 'struct dev_pagemap' to truncate_inode_pages()
> > for fsdax and devdax inodes, and use page_maybe_dma_pinned() to
> > determine when filesystems can safely truncate DAX mappings vs DMA.
> > 
> > The longer story is that DAX has caused friction with folio development
> > and other device-memory use cases due to its hack of using a
> > page-reference count of 1 to indicate that the page is DMA idle. That
> > situation arose from the mistake of not managing DAX page reference
> > counts at map time. The lack of page reference counting at map time grew
> > organically from the original DAX experiment of attempting to manage DAX
> > mappings without page structures. The page lock, dirty tracking and
> > other entry management was supported sans pages. However, the page
> > support was then bolted on incrementally so solve problems with gup,
> > memory-failure, and all the other kernel services that are missing when
> > a pfn does not have an associated page structure.
> > 
> > Since then John has led an effort to account for when a page is pinned
> > for DMA vs other sources that elevate the reference count. The
> > page_maybe_dma_pinned() helper slots in seamlessly to replace the need
> > to track transitions to page->_refount == 1.
> > 
> > The larger change in this set comes from Jason's observation that
> > inserting DAX mappings without any reference taken is a bug. So
> > dax_insert_entry(), that fsdax uses, is updated to take 'struct
> > dev_pagemap' references, and devdax is updated to reuse the same.
> 
> It wasn't pagemap references that were the problem, it was struct page
> references.
> 
> pagemap is just something that should be ref'd in the background, as
> long as a struct page has a positive reference the pagemap should be
> considered referenced, IMHO free_zone_device_page() should be dealing
> with this - put the pagemap after calling page_free().

Yes.

I think I got caught admiring the solution of the
page_maybe_dma_pinned() conversion for replacing the ugly observation of
the 2 -> 1 refcount transition, and then introduced this breakage. I
will rework this to catch the 0 to 1 transition of the refcount for
incrementing and use free_zone_device_page() to drop the pgmap
reference.

> Pagemap is protecting page->pgmap from UAF so we must ensure we hold
> it when we do pgmap->ops
> 
> That should be the only put, and it should pair with the only get
> which happens when the driver takes a 0 refcount page out of its free
> list and makes it have a refcount of 1.
> 
> > page mapping helpers. One of the immediate hurdles is the usage of
> > pmd_devmap() to distinguish large page mappings that are not transparent
> > huge pages.
> 
> And this is because the struct page refcounting is not right :|
> 
> I had thought the progression would be to make fsdax use compound
> folios, install compound folios in the PMD, remove all the special
> case refcounting for DAX from the pagetable code, then address the
> pgmap issue from the basis of working page->refcount, eg by putting a
> pgmap put in right after the op->page_free call.

As far as I can see as long as the pgmap is managed at map and
free_zone_device_page() time then the large folio conversion can come
later.

> Can we continue to have the weird page->refcount behavior and still
> change the other things?

No at a minimum the pgmap vs page->refcount problem needs to be solved
first.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux