On Tue 12-11-24 18:49:46, Asahi Lina wrote: > On 11/8/24 9:16 PM, Jan Kara wrote: > > On Fri 08-11-24 01:09:54, Asahi Lina wrote: > >> On 11/7/24 7:01 PM, Jan Kara wrote: > >>> On Wed 06-11-24 11:59:44, Dan Williams wrote: > >>>> Jan Kara wrote: > >>>> [..] > >>>>>> This WARN still feels like the wrong thing, though. Right now it is the > >>>>>> only thing in DAX code complaining on a page size/block size mismatch > >>>>>> (at least for virtiofs). If this is so important, I feel like there > >>>>>> should be a higher level check elsewhere, like something happening at > >>>>>> mount time or on file open. It should actually cause the operations to > >>>>>> fail cleanly. > >>>>> > >>>>> That's a fair point. Currently filesystems supporting DAX check for this in > >>>>> their mount code because there isn't really a DAX code that would get > >>>>> called during mount and would have enough information to perform the check. > >>>>> I'm not sure adding a new call just for this check makes a lot of sense. > >>>>> But if you have some good place in mind, please tell me. > >>>> > >>>> Is not the reason that dax_writeback_mapping_range() the only thing > >>>> checking ->i_blkbits because 'struct writeback_control' does writeback > >>>> in terms of page-index ranges? > >>> > >>> To be fair, I don't remember why we've put the assertion specifically into > >>> dax_writeback_mapping_range(). But as Dave explained there's much more to > >>> this blocksize == pagesize limitation in DAX than just doing writeback in > >>> terms of page-index ranges. The whole DAX entry tracking in xarray would > >>> have to be modified to properly support other entry sizes than just PTE & > >>> PMD sizes because otherwise the entry locking just doesn't provide the > >>> guarantees that are expected from filesystems (e.g. you could have parallel > >>> modifications happening to a single fs block in pagesize < blocksize case). > >>> > >>>> All other dax entry points are filesystem controlled that know the > >>>> block-to-pfn-to-mapping relationship. > >>>> > >>>> Recall that dax_writeback_mapping_range() is historically for pmem > >>>> persistence guarantees to make sure that applications write through CPU > >>>> cache to media. > >>> > >>> Correct. > >>> > >>>> Presumably there are no cache coherency concerns with fuse and dax > >>>> writes from the guest side are not a risk of being stranded in CPU > >>>> cache. Host side filesystem writeback will take care of them when / if > >>>> the guest triggers a storage device cache flush, not a guest page cache > >>>> writeback. > >>> > >>> I'm not so sure. When you call fsync(2) in the guest on virtiofs file, it > >>> should provide persistency guarantees on the file contents even in case of > >>> *host* power failure. So if the guest is directly mapping host's page cache > >>> pages through virtiofs, filemap_fdatawrite() call in the guest must result > >>> in fsync(2) on the host to persist those pages. And as far as I vaguely > >>> remember that happens by KVM catching the arch_wb_cache_pmem() calls and > >>> issuing fsync(2) on the host. But I could be totally wrong here. > >> > >> I don't think that's how it actually works, at least on arm64. > >> arch_wb_cache_pmem() calls dcache_clean_pop() which is either dc cvap or > >> dc cvac. Those are trapped by HCR_EL2<TPC>, and that is never set by KVM. > >> > >> There was some discussion of this here: > >> https://lore.kernel.org/all/20190702055937.3ffpwph7anvohmxu@US-160370MP2.local/ > > > > I see. Thanks for correcting me. > > > >> But I'm not sure that all really made sense then. > >> > >> msync() and fsync() should already provide persistence. Those end up > >> calling vfs_fsync_range(), which becomes a FUSE fsync(), which fsyncs > >> (or fdatasyncs) the whole file. What I'm not so sure is whether there > >> are any other codepaths that also need to provide those guarantees which > >> *don't* end up calling fsync on the VFS. For example, the manpages kind > >> of imply munmap() syncs, though as far as I can tell that's not actually > >> the case. If there are missing sync paths, then I think those might just > >> be broken right now... > > > > munmap(2) is not an issue because that has no persistency guarantees in > > case of power failure attached to it. Thinking about it some more I agree > > that just dropping dax_writeback_mapping_range() from virtiofs should be > > safe. The modifications are going to be persisted by the host eventually > > (so writeback as such isn't needed) and all crash-safe guarantees are > > revolving around calls like fsync(2), sync(2), sync_fs(2) which get passed > > by fuse and hopefully acted upon on the host. I'm quite confident with this > > because even standard filesystems such as ext4 flush disk caches only in > > response to operations like these (plus some in journalling code but that's > > a separate story). > > > > Honza > > I think we should go with that then. Should I send it as Suggested-by: > Dan or do you want to send it? I say go ahead and send it with Dan's suggested-by :) Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR