Re: ZONE_DEVICE refcounting

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 22 Mar 2024 14:18:13 +1100

On Fri, Mar 22, 2024 at 11:01:25AM +1100, Alistair Popple wrote:
> 
> Dan Williams <dan.j.williams@xxxxxxxxx> writes:
> 
> > Alistair Popple wrote:
> >> 
> >> Alistair Popple <apopple@xxxxxxxxxx> writes:
> >> 
> >> > Dan Williams <dan.j.williams@xxxxxxxxx> writes:
> >> >
> >> >> Alistair Popple wrote:
> >> >
> >> > I also noticed folio_anon() is not safe to call on a FS DAX page due to
> >> > sharing PAGE_MAPPING_DAX_SHARED.
> >> 
> >> Also it feels like I could be missing something here. AFAICT the
> >> page->mapping and page->index fields can't actually be used outside of
> >> fs/dax because they are overloaded for the shared case. Therefore
> >> setting/clearing them could be skipped and the only reason for doing so
> >> is so dax_associate_entry()/dax_disassociate_entry() can generate
> >> warnings which should never occur anyway. So all that code is
> >> functionally unnecessary.
> >
> > What do you mean outside of fs/dax, do you literally mean outside of
> > fs/dax.c, or the devdax case (i.e. dax without fs-entanglements)?
> 
> Only the cases fs dax pages might need it. ie. Not devdax which I
> haven't looked at closely yet.
> 
> > Memory
> > failure needs ->mapping and ->index to rmap dax pages. See
> > mm/memory-failure.c::__add_to_kill() and
> > mm/memory-failure.c::__add_to_kill_fsdax() where that latter one is for
> > cases where the fs needs has signed up to react to dax page failure.
> 
> How does that work for reflink/shared pages which overwrite
> page->mapping and page->index?

Via reverse mapping in the *filesystem*, not the mm rmap stuff.

pmem_pagemap_memory_failure()
  dax_holder_notify_failure()
    .notify_failure()
      xfs_dax_notify_failure()
        xfs_dax_notify_ddev_failure()
	  xfs_rmap_query_range(xfs_dax_failure_fn)
	     xfs_dax_failure_fn(rmap record)
	       <grabs inode from cache>
	       <converts range to file offset>
	       mf_dax_kill_procs(inode->mapping, pgoff)
	         collect_procs_fsdax(mapping, page)
		   add_to_kill_fsdax(task)
		     __add_to_kill(task)
		 unmap_and_kill_tasks()

Remember: in FSDAX, the pages are the storage media physically owned
by the filesystem, not the mm subsystem. Hence answering questions
like "who owns this page" can only be answered correctly by asking
the filesystem.

We shortcut that for pages that only have one owner - we just store
the owner information in the page as a {mapping, offset} tuple. But
when we have multiple owners, the only way to find all the {mapping,
offset} tuples is to ask the filesystem to find all the owners of
that page.

Hence the special case values for page->mapping/page->index for
pages over shared filesystem extents. These shared extents are

communicated to the fsdax layer via the IOMAP_F_SHARED flag
in the iomaps returned by the filesystem. This flag is the trigger
for the special mapping share count behaviour to be used. e.g. see
dax_insert_entry(iomap_iter) -> dax_associate_entry(shared) ->
dax_page_share_get()....

> Eg. in __add_to_kill() if *p is a shared fs
> dax page then p->mapping == PAGE_MAPPING_DAX_SHARED and
> page_address_in_vma(vma, p) will probably crash.

As per above, we don't get the mapping from the page in those cases.

If you haven't got access to the page though a filesystem method and
guaranteed that truncate() can't free it from under you, then you're
probably doing the wrong thing with fsdax...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx