Re: [PATCH] dax: Allow block size > PAGE_SIZE

Jan Kara <jack@xxxxxxx> · Fri, 8 Nov 2024 13:16:41 +0100

On Fri 08-11-24 01:09:54, Asahi Lina wrote:
> On 11/7/24 7:01 PM, Jan Kara wrote:
> > On Wed 06-11-24 11:59:44, Dan Williams wrote:
> >> Jan Kara wrote:
> >> [..]
> >>>> This WARN still feels like the wrong thing, though. Right now it is the
> >>>> only thing in DAX code complaining on a page size/block size mismatch
> >>>> (at least for virtiofs). If this is so important, I feel like there
> >>>> should be a higher level check elsewhere, like something happening at
> >>>> mount time or on file open. It should actually cause the operations to
> >>>> fail cleanly.
> >>>
> >>> That's a fair point. Currently filesystems supporting DAX check for this in
> >>> their mount code because there isn't really a DAX code that would get
> >>> called during mount and would have enough information to perform the check.
> >>> I'm not sure adding a new call just for this check makes a lot of sense.
> >>> But if you have some good place in mind, please tell me.
> >>
> >> Is not the reason that dax_writeback_mapping_range() the only thing
> >> checking ->i_blkbits because 'struct writeback_control' does writeback
> >> in terms of page-index ranges?
> > 
> > To be fair, I don't remember why we've put the assertion specifically into
> > dax_writeback_mapping_range(). But as Dave explained there's much more to
> > this blocksize == pagesize limitation in DAX than just doing writeback in
> > terms of page-index ranges. The whole DAX entry tracking in xarray would
> > have to be modified to properly support other entry sizes than just PTE &
> > PMD sizes because otherwise the entry locking just doesn't provide the
> > guarantees that are expected from filesystems (e.g. you could have parallel
> > modifications happening to a single fs block in pagesize < blocksize case).
> > 
> >> All other dax entry points are filesystem controlled that know the
> >> block-to-pfn-to-mapping relationship.
> >>
> >> Recall that dax_writeback_mapping_range() is historically for pmem
> >> persistence guarantees to make sure that applications write through CPU
> >> cache to media.
> > 
> > Correct.
> > 
> >> Presumably there are no cache coherency concerns with fuse and dax
> >> writes from the guest side are not a risk of being stranded in CPU
> >> cache. Host side filesystem writeback will take care of them when / if
> >> the guest triggers a storage device cache flush, not a guest page cache
> >> writeback.
> > 
> > I'm not so sure. When you call fsync(2) in the guest on virtiofs file, it
> > should provide persistency guarantees on the file contents even in case of
> > *host* power failure. So if the guest is directly mapping host's page cache
> > pages through virtiofs, filemap_fdatawrite() call in the guest must result
> > in fsync(2) on the host to persist those pages. And as far as I vaguely
> > remember that happens by KVM catching the arch_wb_cache_pmem() calls and
> > issuing fsync(2) on the host. But I could be totally wrong here.
> 
> I don't think that's how it actually works, at least on arm64.
> arch_wb_cache_pmem() calls dcache_clean_pop() which is either dc cvap or
> dc cvac. Those are trapped by HCR_EL2<TPC>, and that is never set by KVM.
> 
> There was some discussion of this here:
> https://lore.kernel.org/all/20190702055937.3ffpwph7anvohmxu@US-160370MP2.local/

I see. Thanks for correcting me.

> But I'm not sure that all really made sense then.
> 
> msync() and fsync() should already provide persistence. Those end up
> calling vfs_fsync_range(), which becomes a FUSE fsync(), which fsyncs
> (or fdatasyncs) the whole file. What I'm not so sure is whether there
> are any other codepaths that also need to provide those guarantees which
> *don't* end up calling fsync on the VFS. For example, the manpages kind
> of imply munmap() syncs, though as far as I can tell that's not actually
> the case. If there are missing sync paths, then I think those might just
> be broken right now...

munmap(2) is not an issue because that has no persistency guarantees in
case of power failure attached to it. Thinking about it some more I agree
that just dropping dax_writeback_mapping_range() from virtiofs should be
safe. The modifications are going to be persisted by the host eventually
(so writeback as such isn't needed) and all crash-safe guarantees are
revolving around calls like fsync(2), sync(2), sync_fs(2) which get passed
by fuse and hopefully acted upon on the host. I'm quite confident with this
because even standard filesystems such as ext4 flush disk caches only in
response to operations like these (plus some in journalling code but that's
a separate story).

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR