Re: [PATCH v3 00/25] fs/dax: Fix ZONE_DEVICE page reference counts

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Dec 14, 2024 at 04:22:58PM +0100, David Hildenbrand wrote:
> On 14.12.24 02:39, Dan Williams wrote:
> > [ add akpm and sfr for next steps ]
> > 
> > Alistair Popple wrote:
> > > Main updates since v2:
> > > 
> > >   - Rename the DAX specific dax_insert_XXX functions to vmf_insert_XXX
> > >     and have them pass the vmf struct.
> > > 
> > >   - Seperate out the device DAX changes.
> > > 
> > >   - Restore the page share mapping counting and associated warnings.
> > > 
> > >   - Rework truncate to require file-systems to have previously called
> > >     dax_break_layout() to remove the address space mapping for a
> > >     page. This found several bugs which are fixed by the first half of
> > >     the series. The motivation for this was initially to allow the FS
> > >     DAX page-cache mappings to hold a reference on the page.
> > > 
> > >     However that turned out to be a dead-end (see the comments on patch
> > >     21), but it found several bugs and I think overall it is an
> > >     improvement so I have left it here.
> > > 
> > > Device and FS DAX pages have always maintained their own page
> > > reference counts without following the normal rules for page reference
> > > counting. In particular pages are considered free when the refcount
> > > hits one rather than zero and refcounts are not added when mapping the
> > > page.
> > > 
> > > Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> > > mechanism for allowing GUP to hold references on the page (see
> > > get_dev_pagemap). However there doesn't seem to be any reason why FS
> > > DAX pages need their own reference counting scheme.
> > > 
> > > By treating the refcounts on these pages the same way as normal pages
> > > we can remove a lot of special checks. In particular pXd_trans_huge()
> > > becomes the same as pXd_leaf(), although I haven't made that change
> > > here. It also frees up a valuable SW define PTE bit on architectures
> > > that have devmap PTE bits defined.
> > > 
> > > It also almost certainly allows further clean-up of the devmap managed
> > > functions, but I have left that as a future improvment. It also
> > > enables support for compound ZONE_DEVICE pages which is one of my
> > > primary motivators for doing this work.
> > 
> > So this is feeling ready for -next exposure, and ideally merged for v6.14. I
> > see the comments from John and Bjorn and that you were going to respin for
> > that, but if it's just those details things they can probably be handled
> > incrementally.
> > 
> > Alistair, are you ready for this to hit -next?

Yeah, I'm pretty happy with the series now. It "feels" right.

There's a couple of dumb build bot errors, so I was going to respin to fix
those as well. I got caught up with a few other things so was just letting this
sit awaiting feedback, but I should be able to post a respin early this week.

> > As for which tree...
> > 
> > Andrew, we could take this through -mm, but my first instinct would be to try
> > to take it through nvdimm.git mainly to offload any conflict wrangling work and
> > small fixups which are likely to be an ongoing trickle.
> > 
> > However, I am not going to put up much of a fight if others prefer this go
> > through -mm.
> > 
> > Thoughts?
> 
> I'm in the process of preparing v2 of [1] that will result in conflicts with
> this series in the rmap code (in particular [PATCH v3 14/25] huge_memory:
> Allow mappings of PUD sized pages).
> 
> I'll be away for 2 weeks over Christmas, but I assume I'll manage to post v2
> shortly.
> 
> Which reminds me that I still have to take a closer look at some things in
> this series :) Especially also #14 regarding accounting.
> 
> I wonder if we could split out the rmap changes in #14, and have that patch
> simply in two trees? No idea.

I could split out the first half (patches 1 - 8) into a series to go via
nvdimm.git, because they are actually standalone clean ups that I think are
worthwhile anyway.

The remainder are more -mm focussed. However they do depend on the fs/dax
cleanups in the first half so the trick would be making sure Andrew only takes
them if the nvdimm.git changes have made it into -next. I'm happy with either
approach, so let me know if I should split the series or not.

 - Alistair

> [1]
> https://lore.kernel.org/all/20240829165627.2256514-1-david@xxxxxxxxxx/T/#u
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux