Re: [PATCH] dax: Allow block size > PAGE_SIZE

Dan Williams <dan.j.williams@xxxxxxxxx> · Thu, 7 Nov 2024 13:22:11 -0800

Jan Kara wrote:
> On Wed 06-11-24 11:59:44, Dan Williams wrote:
> > Jan Kara wrote:
> > [..]
> > > > This WARN still feels like the wrong thing, though. Right now it is the
> > > > only thing in DAX code complaining on a page size/block size mismatch
> > > > (at least for virtiofs). If this is so important, I feel like there
> > > > should be a higher level check elsewhere, like something happening at
> > > > mount time or on file open. It should actually cause the operations to
> > > > fail cleanly.
> > > 
> > > That's a fair point. Currently filesystems supporting DAX check for this in
> > > their mount code because there isn't really a DAX code that would get
> > > called during mount and would have enough information to perform the check.
> > > I'm not sure adding a new call just for this check makes a lot of sense.
> > > But if you have some good place in mind, please tell me.
> > 
> > Is not the reason that dax_writeback_mapping_range() the only thing
> > checking ->i_blkbits because 'struct writeback_control' does writeback
> > in terms of page-index ranges?
> 
> To be fair, I don't remember why we've put the assertion specifically into
> dax_writeback_mapping_range(). But as Dave explained there's much more to
> this blocksize == pagesize limitation in DAX than just doing writeback in
> terms of page-index ranges. The whole DAX entry tracking in xarray would
> have to be modified to properly support other entry sizes than just PTE &
> PMD sizes because otherwise the entry locking just doesn't provide the
> guarantees that are expected from filesystems (e.g. you could have parallel
> modifications happening to a single fs block in pagesize < blocksize case).

Oh, yes, agree with that, was just observing that if "i_blkbits !=
PAGE_SHIFT" then at a mininum the range_start and range_end values from
writeback_control would need to be checked for alignment to the block
boundary.

> > All other dax entry points are filesystem controlled that know the
> > block-to-pfn-to-mapping relationship.
> > 
> > Recall that dax_writeback_mapping_range() is historically for pmem
> > persistence guarantees to make sure that applications write through CPU
> > cache to media.
> 
> Correct.
> 
> > Presumably there are no cache coherency concerns with fuse and dax
> > writes from the guest side are not a risk of being stranded in CPU
> > cache. Host side filesystem writeback will take care of them when / if
> > the guest triggers a storage device cache flush, not a guest page cache
> > writeback.
> 
> I'm not so sure. When you call fsync(2) in the guest on virtiofs file, it
> should provide persistency guarantees on the file contents even in case of
> *host* power failure.

It should, yes, but not necessarily through
dax_writeback_mapping_range().

> So if the guest is directly mapping host's page cache pages through
> virtiofs, filemap_fdatawrite() call in the guest must result in
> fsync(2) on the host to persist those pages. And as far as I vaguely
> remember that happens by KVM catching the arch_wb_cache_pmem() calls
> and issuing fsync(2) on the host. But I could be totally wrong here.

While I imagine you could invent some scheme to trap
dax_flush()/arch_wb_cache_pmem() as the signal to trigger page writeback
on the host, my expectation is that should be handled by the
REQ_{PREFLUSH,FUA} to the backing device that follows a page-cache
writeback event. This is the approach taken by virtio_pmem.

Now, if virtio_fs does not have a block_device to receive those requests
then I can see why trapping arch_wb_cache_pmem() is attempted, but a
backing device signal to flush the host conceptually makes more sense to
me because dax, on the guest side, explicitly means there are no
software buffers to write-back. The host just needs a coarse signal that
if it is buffering any pages on behalf of the guest, it now needs to
flush them.