Re: [PATCH 00/13] Fix the DAX-gup mistake

Dan Williams <dan.j.williams@xxxxxxxxx> · Wed, 7 Sep 2022 11:43:52 -0700

Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Tue, Sep 06, 2022 at 05:54:54PM -0700, Dan Williams wrote:
> > > Dan Williams wrote:
> > > > Jason Gunthorpe wrote:
> > > > > On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote:
> > > > > > Jason Gunthorpe wrote:
> > > > > > > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote:
> > > > > > > 
> > > > > > > > > Can we continue to have the weird page->refcount behavior and still
> > > > > > > > > change the other things?
> > > > > > > > 
> > > > > > > > No at a minimum the pgmap vs page->refcount problem needs to be solved
> > > > > > > > first.
> > > > > > > 
> > > > > > > So who will do the put page after the PTE/PMD's are cleared out? In
> > > > > > > the normal case the tlb flusher does it integrated into zap..
> > > > > > 
> > > > > > AFAICS the zap manages the _mapcount not _refcount. Are you talking
> > > > > > about page_remove_rmap() or some other reference count drop?
> > > > > 
> > > > > No, page refcount.
> > > > > 
> > > > > __tlb_remove_page() eventually causes a put_page() via
> > > > > tlb_batch_pages_flush() calling free_pages_and_swap_cache()
> > > > > 
> > > > > Eg:
> > > > > 
> > > > >  *  MMU_GATHER_NO_GATHER
> > > > >  *
> > > > >  *  If the option is set the mmu_gather will not track individual pages for
> > > > >  *  delayed page free anymore. A platform that enables the option needs to
> > > > >  *  provide its own implementation of the __tlb_remove_page_size() function to
> > > > >  *  free pages.
> > > > 
> > > > Ok, yes, that is a vm_normal_page() mechanism which I was going to defer
> > > > since it is incremental to the _refcount handling fix and maintain that
> > > > DAX pages are still !vm_normal_page() in this set.
> > > > 
> > > > > > > Can we safely have the put page in the fsdax side after the zap?
> > > > > > 
> > > > > > The _refcount is managed from the lifetime insert_page() to
> > > > > > truncate_inode_pages(), where for DAX those are managed from
> > > > > > dax_insert_dentry() to dax_delete_mapping_entry().
> > > > > 
> > > > > As long as we all understand the page doesn't become re-allocatable
> > > > > until the refcount reaches 0 and the free op is called it may be OK!
> > > > 
> > > > Yes, but this does mean that page_maybe_dma_pinned() is not sufficient for
> > > > when the filesystem can safely reuse the page, it really needs to wait
> > > > for the reference count to drop to 0 similar to how it waits for the
> > > > page-idle condition today.
> > > 
> > > This gets tricky with break_layouts(). For example xfs_break_layouts()
> > > wants to ensure that the page is gup idle while holding the mmap lock.
> > > If the page is not gup idle it needs to drop locks and retry. It is
> > > possible the path to drop a page reference also needs to acquire
> > > filesystem locs. Consider odd cases like DMA from one offset to another
> > > in the same file. So waiting with filesystem locks held is off the
> > > table, which also means that deferring the wait until
> > > dax_delete_mapping_entry() time is also off the table.
> > > 
> > > That means that even after the conversion to make DAX page references
> > > 0-based it will still be the case that filesystem code will be waiting
> > > for the 2 -> 1 transition to indicate "mapped DAX page has no more
> > > external references".
> > 
> > Why?
> > 
> > If you are doing the break layouts wouldn't you first zap the ptes,
> > which will bring the reference to 0 if there are not other users.
> 
> The internals of break layouts does zap the ptes, but it does not remove
> the mapping entries. So, I was limiting my thinking to that constraint,
> but now that I push on it, the need to keep the entry around until the
> final truncate_setsize() event seems soft. It should be ok to upgrade
> break layouts to delete mapping entries, wait for _refcount to drop to
> zero, and then re-evaluate that nothing installed a new entry after
> acquiring the filesystem locks again.

It is still the case that while waiting for the page to go idle it is
associated with its given file / inode. It is possible that
memory-failure, or some other event that requires looking up the page's
association, fires in that time span.

If that happens the page's association to the file needs to be kept in
tact. So it is still the case that while waiting for the final put the
page count needs to remain elevated to maintain the page's association
to the file until break layouts can be sure it is doing the final put
under filesystem locks. I.e. break layouts is "make it safe to do the
truncate", not "do the truncate up front".

The truncate will still move from being done while the _refcount is 1 to
being done while the _refcount is 0, but the condition for break layouts
to signal it is safe to proceed is when it can observe _refcount == 0,
or the 1 -> 0 transition.