Re: [PATCH] dax: Allow block size > PAGE_SIZE

Asahi Lina <lina@xxxxxxxxxxxxx> · Fri, 8 Nov 2024 01:09:54 +0900

On 11/7/24 7:01 PM, Jan Kara wrote:
> On Wed 06-11-24 11:59:44, Dan Williams wrote:
>> Jan Kara wrote:
>> [..]
>>>> This WARN still feels like the wrong thing, though. Right now it is the
>>>> only thing in DAX code complaining on a page size/block size mismatch
>>>> (at least for virtiofs). If this is so important, I feel like there
>>>> should be a higher level check elsewhere, like something happening at
>>>> mount time or on file open. It should actually cause the operations to
>>>> fail cleanly.
>>>
>>> That's a fair point. Currently filesystems supporting DAX check for this in
>>> their mount code because there isn't really a DAX code that would get
>>> called during mount and would have enough information to perform the check.
>>> I'm not sure adding a new call just for this check makes a lot of sense.
>>> But if you have some good place in mind, please tell me.
>>
>> Is not the reason that dax_writeback_mapping_range() the only thing
>> checking ->i_blkbits because 'struct writeback_control' does writeback
>> in terms of page-index ranges?
> 
> To be fair, I don't remember why we've put the assertion specifically into
> dax_writeback_mapping_range(). But as Dave explained there's much more to
> this blocksize == pagesize limitation in DAX than just doing writeback in
> terms of page-index ranges. The whole DAX entry tracking in xarray would
> have to be modified to properly support other entry sizes than just PTE &
> PMD sizes because otherwise the entry locking just doesn't provide the
> guarantees that are expected from filesystems (e.g. you could have parallel
> modifications happening to a single fs block in pagesize < blocksize case).
> 
>> All other dax entry points are filesystem controlled that know the
>> block-to-pfn-to-mapping relationship.
>>
>> Recall that dax_writeback_mapping_range() is historically for pmem
>> persistence guarantees to make sure that applications write through CPU
>> cache to media.
> 
> Correct.
> 
>> Presumably there are no cache coherency concerns with fuse and dax
>> writes from the guest side are not a risk of being stranded in CPU
>> cache. Host side filesystem writeback will take care of them when / if
>> the guest triggers a storage device cache flush, not a guest page cache
>> writeback.
> 
> I'm not so sure. When you call fsync(2) in the guest on virtiofs file, it
> should provide persistency guarantees on the file contents even in case of
> *host* power failure. So if the guest is directly mapping host's page cache
> pages through virtiofs, filemap_fdatawrite() call in the guest must result
> in fsync(2) on the host to persist those pages. And as far as I vaguely
> remember that happens by KVM catching the arch_wb_cache_pmem() calls and
> issuing fsync(2) on the host. But I could be totally wrong here.

I don't think that's how it actually works, at least on arm64.
arch_wb_cache_pmem() calls dcache_clean_pop() which is either dc cvap or
dc cvac. Those are trapped by HCR_EL2<TPC>, and that is never set by KVM.

There was some discussion of this here:
https://lore.kernel.org/all/20190702055937.3ffpwph7anvohmxu@US-160370MP2.local/

But I'm not sure that all really made sense then.

msync() and fsync() should already provide persistence. Those end up
calling vfs_fsync_range(), which becomes a FUSE fsync(), which fsyncs
(or fdatasyncs) the whole file. What I'm not so sure is whether there
are any other codepaths that also need to provide those guarantees which
*don't* end up calling fsync on the VFS. For example, the manpages kind
of imply munmap() syncs, though as far as I can tell that's not actually
the case. If there are missing sync paths, then I think those might just
be broken right now...

~~ Lina