On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote: > > > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > > > > > > > > > On Fri, Feb 26, 2021 at 09:45:45AM +0000, ruansy.fnst@xxxxxxxxxxx wrote: > > > > > > Hi, guys > > > > > > > > > > > > Beside this patchset, I'd like to confirm something about the > > > > > > "EXPERIMENTAL" tag for dax in XFS. > > > > > > > > > > > > In XFS, the "EXPERIMENTAL" tag, which is reported in waring message > > > > > > when we mount a pmem device with dax option, has been existed for a > > > > > > while. It's a bit annoying when using fsdax feature. So, my initial > > > > > > intention was to remove this tag. And I started to find out and solve > > > > > > the problems which prevent it from being removed. > > > > > > > > > > > > As is talked before, there are 3 main problems. The first one is "dax > > > > > > semantics", which has been resolved. The rest two are "RMAP for > > > > > > fsdax" and "support dax reflink for filesystem", which I have been > > > > > > working on. > > > > > > > > > > <nod> > > > > > > > > > > > So, what I want to confirm is: does it means that we can remove the > > > > > > "EXPERIMENTAL" tag when the rest two problem are solved? > > > > > > > > > > Yes. I'd keep the experimental tag for a cycle or two to make sure that > > > > > nothing new pops up, but otherwise the two patchsets you've sent close > > > > > those two big remaining gaps. Thank you for working on this! > > > > > > > > > > > Or maybe there are other important problems need to be fixed before > > > > > > removing it? If there are, could you please show me that? > > > > > > > > > > That remains to be seen through QA/validation, but I think that's it. > > > > > > > > > > Granted, I still have to read through the two patchsets... > > > > > > > > I've been meaning to circle back here as well. > > > > > > > > My immediate concern is the issue Jason recently highlighted [1] with > > > > respect to invalidating all dax mappings when / if the device is > > > > ripped out from underneath the fs. I don't think that will collide > > > > with Ruan's implementation, but it does need new communication from > > > > driver to fs about removal events. > > > > > > > > [1]: http://lore.kernel.org/r/CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@xxxxxxxxxxxxxx > > > > > > Oh, yay. > > > > > > The XFS shutdown code is centred around preventing new IO from being > > > issued - we don't actually do anything about DAX mappings because, > > > well, I don't think anyone on the filesystem side thought they had > > > to do anything special if pmem went away from under it. > > > > > > My understanding -was- that the pmem removal invalidates > > > all the ptes currently mapped into CPU page tables that point at > > > the dax device across the system. THe vmas that manage these > > > mappings are not really something the filesystem really manages, > > > but a function of the mm subsystem. What the filesystem cares about > > > is that it gets page faults triggered when a change of state occurs > > > so that it can remap the page to it's backing store correctly. > > > > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the > > > CPU ptes, because then when then when userspace tries to access the > > > mapped DAX pages we get a new page fault. In processing the fault, the > > > filesystem will try to get direct access to the pmem from the block > > > device. This will get an ENODEV error from the block device because > > > because the backing store (pmem) has been unplugged and is no longer > > > there... > > > > > > AFAICT, as long as pmem removal invalidates all the active ptes that > > > point at the pmem being removed, the filesystem doesn't need to > > > care about device removal at all, DAX or no DAX... > > > > How would the pmem removal do that without walking all the active > > inodes in the fs at the time of shutdown and call > > unmap_mapping_range(inode->i_mapping, 0, 0, 1)? > > Which then immediately ends up back at the vmas that manage the ptes > to unmap them. > > Isn't finding the vma(s) that map a specific memory range exactly > what the rmap code in the mm subsystem is supposed to address? rmap can lookup only vmas from a virt address relative to a given mm_struct. The driver has neither the list of mm_struct objects nor virt addresses to do a lookup. All it knows is that someone might have mapped pages through the fsdax interface. To me this looks like a notifier that fires from memunmap_pages() after dev_pagemap_kill() to notify any block_device associated with that dev_pagemap() to say that any dax mappings arranged through this block_device are now invalid. The reason to do this after dev_pagemap_kill() is so that any new mapping attempts that are racing the removal will be blocked. The receiver of that notification needs to go from a block_device to a superblock that has mapped inodes and walk ->sb_inodes triggering the unmap/invalidation.