On Tue, Dec 14, 2021 at 03:43:38PM -0800, Dan Williams wrote: > On Tue, Dec 14, 2021 at 12:33 PM Vivek Goyal <vgoyal@xxxxxxxxxx> wrote: > > > > On Tue, Dec 14, 2021 at 08:41:30AM -0800, Dan Williams wrote: > > > On Tue, Dec 14, 2021 at 6:23 AM Vivek Goyal <vgoyal@xxxxxxxxxx> wrote: > > > > > > > > On Mon, Dec 13, 2021 at 09:23:18AM +0100, Christoph Hellwig wrote: > > > > > On Sun, Dec 12, 2021 at 06:44:26AM -0800, Dan Williams wrote: > > > > > > On Fri, Dec 10, 2021 at 6:17 AM Vivek Goyal <vgoyal@xxxxxxxxxx> wrote: > > > > > > > Going forward, I am wondering should virtiofs use flushcache version as > > > > > > > well. What if host filesystem is using DAX and mapping persistent memory > > > > > > > pfn directly into qemu address space. I have never tested that. > > > > > > > > > > > > > > Right now we are relying on applications to do fsync/msync on virtiofs > > > > > > > for data persistence. > > > > > > > > > > > > This sounds like it would need coordination with a paravirtualized > > > > > > driver that can indicate whether the host side is pmem or not, like > > > > > > the virtio_pmem driver. However, if the guest sends any fsync/msync > > > > > > you would still need to go explicitly cache flush any dirty page > > > > > > because you can't necessarily trust that the guest did that already. > > > > > > > > > > Do we? The application can't really know what backend it is on, so > > > > > it sounds like the current virtiofs implementation doesn't really, does it? > > > > > > > > Agreed that application does not know what backend it is on. So virtiofs > > > > just offers regular posix API where applications have to do fsync/msync > > > > for data persistence. No support for mmap(MAP_SYNC). We don't offer persistent > > > > memory programming model on virtiofs. That's not the expectation. DAX > > > > is used only to bypass guest page cache. > > > > > > > > With this assumption, I think we might not have to use flushcache version > > > > at all even if shared filesystem is on persistent memory on host. > > > > > > > > - We mmap() host files into qemu address space. So any dax store in virtiofs > > > > should make corresponding pages dirty in page cache on host and when > > > > and fsync()/msync() comes later, it should flush all the data to PMEM. > > > > > > > > - In case of file extending writes, virtiofs falls back to regular > > > > FUSE_WRITE path (and not use DAX), and in that case host pmem driver > > > > should make sure writes are flushed to pmem immediately. > > > > > > > > Are there any other path I am missing. If not, looks like we might not > > > > have to use flushcache version in virtiofs at all as long as we are not > > > > offering guest applications user space flushes and MAP_SYNC support. > > > > > > > > We still might have to use machine check safe variant though as loads > > > > might generate synchronous machine check. What's not clear to me is > > > > that if this MC safe variant should be used only in case of PMEM or > > > > should it be used in case of non-PMEM as well. > > > > > > It should be used on any memory address that can throw exception on > > > load, which is any physical address, in paths that can tolerate > > > memcpy() returning an error code, most I/O paths, and can tolerate > > > slower copy performance on older platforms that do not support MC > > > recovery with fast string operations, to date that's only PMEM users. > > > > Ok, So basically latest cpus can do fast string operations with MC > > recovery so that using MC safe variant is not a problem. > > > > Then there is range of cpus which can do MC recovery but do slower > > versions of memcpy and that's where the issue is. > > > > So if we knew that virtiofs dax window is backed by a pmem device > > then we should always use MC safe variant. Even if it means paying > > the price of slow version for the sake of correctness. > > > > But if we are not using pmem on host, then there is no point in > > using MC safe variant. > > > > IOW. > > > > if (virtiofs_backed_by_pmem) { > > No, PMEM should not be considered at all relative to whether to use MC > or not, it is 100% a decision of whether you expect virtiofs users > will balk more at unhandled machine checks or performance regressions > on the platforms that set "enable_copy_mc_fragile()". If we don't handle machine check, kernel will panic(), right? So that's the trade off. Whether get higher performance (on select platforms) and crash if MC happens OR get slower memcpy() performance (on select platoforms) and recover from MC. Hmm... > See > quirk_intel_brickland_xeon_ras_cap() and > quirk_intel_purley_xeon_ras_cap() in arch/x86/kernel/quirks.c. > > > use_mc_safe_version > > else > > use_non_mc_safe_version > > } > > > > Now question is, how do we know if virtiofs dax window is backed by > > a pmem or not. I checked virtio_pmem driver and that does not seem > > to communicate anything like that. It just communicates start of the > > range and size of range, nothing else. > > > > I don't have full handle on stack of modules of virtio_pmem, but my guess > > is it probably is using MC safe version always (because it does not > > know anthing about the backing storage). > > > > /me will definitely like to pay penalty of slower memcpy if virtiofs > > device is not backed by a pmem. > > I assume you meant "not like", Yes. It was a typo. > but again PMEM has no bearing on > whether using that device will throw machine checks. I'm sure there > are people that would make the opposite tradeoff. Why pmem driver does not have to make such trade off and it always uses machine check variant. As you mentioned machine checks can happen with DRAM too. So why loading from page cache not use machine check variant (or given an option to user allow making a choice). BTW, stefan mentioned that we could think of adding a device feature bit to signal whether to do MC safe memcpy() or not if it becomes really necessary. For now probably let us stick to performance variant and if users demand machine check handling, then either introduce it unconditionally or make it an opt-in based on device feature bit. Thanks Vivek _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization