Re: Question about Experimental of Filesystem DAX.

Dan Williams <dan.j.williams@xxxxxxxxx> · Thu, 31 May 2018 18:03:45 -0700

On Thu, May 31, 2018 at 4:05 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
>> On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
>> <darrick.wong@xxxxxxxxxx> wrote:
>> > On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
>> >> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
>> >> <ross.zwisler@xxxxxxxxxxxxxxx> wrote:
>> >> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
>> >> >> Hello,
>> >> >>
>> >> >>
>> >> >> I would like to know about the Experimental message of Filesystem DAX.
>> >> >> --------------------------------------------------------
>> >> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
>> >> >> --------------------------------------------------------
>> >> >>
>> >> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
>> >> >> and it is(will be?) solved by great effort of MAP_SYNC and
>> >> >> "fix dma vs truncate/hole-punch" patch set.
>> >> >> So, I suppose that the Experimental message can be removed,
>> >> >> but I'm not sure.
>> >> >>
>> >> >> Is it possible?
>> >> >> Otherwise, are there any other issues in Filesystem DAX yet?
>> >> >>
>> >> >> If this is silly question, sorry for noise....
>> >> >>
>> >> >> Thanks,
>> >> >> ---
>> >> >> Yasunori Goto
>> >> >
>> >> > Adding in the XFS and ext4 developers, as it's really their call when to
>> >> > remove this notice.
>> >> >
>> >> > We've talked about this off and on for a long while, but IMHO we should remove
>> >> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
>> >> > before this was removed were:
>> >> >
>> >> > 1) Get consistent handling of the DAX mount option.  We currently have this,
>> >> > as both filesystems will behave the same and fall back and remove the DAX
>> >> > mount option if it is unsupported by the block device, etc.
>> >
>> > <nod>
>> >
>> > As an aside, I wonder if Christoph's musings about "just have the kernel
>> > determine the appropriate dax/non-dax setting from the acpi tables and
>> > skip the inode flag entirely" ever got resolved?
>> >
>> >> > 2) Get consistent handling of the DAX inode option.  We currently have this,
>> >> > as all DAX behavior now happens through the mount option.  If/when we
>> >> > re-enable the per-inode DAX flag we should do it consistently for all DAX
>> >> > enabled filesystems.
>> >
>> > The behavior of the inode flag isn't all that consistent.  ext4 doesn't
>> > support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
>> > directory which will propagate the setting to any files created in that
>> > directory.
>> >
>> > However, if you set or clear it on a file we update the on-disk inode
>> > but we can't change the in-core state flag (S_DAX) until the next
>> > in-core inode instantiation.  It's weird that users can change the flag
>> > but the intended behavior changes won't happen until some ... time ...
>> > in the future??
>> >
>> >> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
>> >> > done, but we at least disallow DAX with XFS features like reflink where it
>> >> > could be an issue.  Darrick, do you still feel like we need to get these
>> >> > working together to remove EXPERIMENTAL, or are you happy enough that we're
>> >> > keeping them separated and that we're keeping user data safe?
>> >
>> > Yes, reflink and dax still need to work together.  I've not heard any
>> > good arguments for why page sharing + copy on write are fundamentally
>> > incompatible with the dax model, or why dax users will never, ever
>> > require reflink.
>>
>> Right, but that's separate from DAX being scream in your face
>> "EXPERIMENTAL!". It's just an additional feature that can be added on
>> once all the normal expectations of a userspace mapping work. I think
>> reliable rmap is the last of those requirements.
>>
>> > The recent thread between Jan and Dan make me wonder if making mappings
>> > share struct pages is going to be a nightmare to add to the mm code,
>> > though...
>>
>> It's going to be a bit messy because a singular page->mapping
>> association is fundamentally incompatible with DAX. Perhaps a linked
>> list of mapping "siblings"?
>
> I'd much prefer the filesystem allocate/control the struct page that
> is inserted into mapping trees so we can have multiple struct pages
> pointing at the one physical page.  That way we can just insert
> these dynamic struct pages into the relevant mappings and it works
> the same way for both DAX and shared page cache pages.

How would that work when there is a 1:1 pfn-to-page and
file-block-to-pfn relationship?

> i.e. the filesystem knows they are shared physical blocks, the
> filesystem controls COW of physical blocks, the filesystem controls
> truncate/invalidation of physical blocks, the filesystem controls
> cache state of the physical blocks. So why are we designing
> infrastructure around the virtual memory and caching infrastructure
> that bypasses the layer that manages and arbitrates access to the
> physical storage?

Yes, because DAX broke the vm's assumptions that pages are not
physical storage blocks.

> This seems like we're well down the path of a architectural layering
> violation that is backing us into a corner we're not going to be
> able to get ourselves out of...

I think it is solvable by teaching the vm more about dax pages and
having it call back into the filesytem for some of these operations.

>> > Also: ideally XFS would also be able to consume poison event
>> > notifications from the pmem so that it can try to deal with metadata
>> > loss, but that's probably a separate effort.
>
> If the design is such that the layer that manages the physical
> storage isn't going to be told about physical storage failures
> before anyone else is informed, it would seem to me like we really
> have introduced a major architectural flaw in DAX....

It would be trivial to hook these notifications into the filesystem.
This is something we've had on the backlog for a long time to circle
back and address. This has been waiting for the xfs reverse-map work
to settle and for one of us pmem developers to free up and do the
work.