Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Haozhong Zhang <haozhong.zhang@xxxxxxxxx> · Thu, 1 Feb 2018 18:17:44 +0800

On 01/31/18 19:02 -0800, Dan Williams wrote:
> On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
> <haozhong.zhang@xxxxxxxxx> wrote:
> > + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.
> >
> > On 01/31/18 16:32 -0800, Dan Williams wrote:
> >> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
> >> <haozhong.zhang@xxxxxxxxx> wrote:
> >> > On 01/31/18 16:08 -0800, Dan Williams wrote:
> >> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> >> >> <haozhong.zhang@xxxxxxxxx> wrote:
> >> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
> >> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> >> >> >> <haozhong.zhang@xxxxxxxxx> wrote:
> >> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> >> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> >> >> >> > files on ext4/xfs file system mounted with '-o dax').
> >> >> >>
> >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> >> >> >> metadata is in sync after a fault. However, that does not make
> >> >> >> filesystem-DAX safe for use with QEMU, because we still need to
> >> >> >> coordinate DMA with fileystem operations. There is no way to do that
> >> >> >> coordination from within a guest. QEMU needs to use device-dax if the
> >> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> >> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
> >> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> >> >> >> pages to be mapped in EPT entries unless / until we have a solution to
> >> >> >> the DMA synchronization problem. Apologies for not noticing this
> >> >> >> earlier.
> >> >> >
> >> >> > QEMU does not truncate or punch holes of the file once it has been
> >> >> > mmap()'ed. Does the problem [1] still exist in such case?
> >> >>
> >> >> Something else on the system might. The only agent that could enforce
> >> >> protection is the kernel, and the kernel will likely just disallow
> >> >> passing addresses from filesystem-dax vmas through to a guest
> >> >> altogether. I think there's even a problem in the non-DAX case unless
> >> >> KVM is pinning pages while they are handed out to a guest. The problem
> >> >> is that we don't have a page cache page to pin in the DAX case.
> >> >>
> >> >
> >> > Does it mean any user-space code like
> >> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
> >> >   // make DMA to ptr
> >> > is unsafe?
> >>
> >> Yes, it is currently unsafe because there is no coordination with the
> >> filesytem if it decides to make block layout changes. We can fix that
> >> in the non-virtualization case by having the filesystem wait for DMA
> >> completion callbacks (i.e. what for all pages to be idle), but as far
> >> as I can see we can't do the same coordination for DMA initiated by a
> >> guest device driver.
> >>
> >
> > I think that fix [1] also works for KVM/QEMU. The guest DMA are
> > performed on two types of devices:
> >
> > 1. For emulated devices, the guest DMA requests are trapped and
> >    actually performed by QEMU on the host side. The host side fix [1]
> >    can cover this case.
> >
> > 2. For passthrough devices, vfio pins all pages, including those
> >    backed by dax mode files, used by the guest if any device is
> >    passthroughed to it. If I read the commit message in [2] correctly,
> >    operations that change the page-to-file offset association of pages
> >    from dax mode files will be deferred until the reference count of
> >    the affected pages becomes 1.  That is, if any passthrough device
> >    is used with a VM, the changes of page-to-file offset will not be
> >    able to happen until the VM is shutdown, so the fix [1] still takes
> >    effect here.
> 
> This sounds like a longterm mapping under control of vfio and not the
> filesystem. See get_user_pages_longterm(), it is a problem if pages
> are pinned indefinitely especially DAX. It sounds like vfio is in the
> same boat as RDMA and cannot support long lived pins of DAX pages. As
> of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
> fix will be to create a "memory-registration with lease" semantic
> available for RDMA so that the kernel can forcibly revoke page pinning
> to perform physical layout changes. In the near it seems
> vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
> that filesystem-dax mappings are explicitly disallowed.

It seems that KVM and VFIO need to switch to get_user_pages_longterm()
which fails getting pages backed by dax mode files.

However, as get_user_pages() and its variants in the current KVM and
VFIO may be called after a VM starts running, e.g., handling EPT
violation on demand, and hotplugging a passthrough device to VM,
simply switching to the longterm version would cause VM crash in those
cases. Therefore, it also needs to patch or document in QEMU to not
use dax files with memory-backend-file. Paolo, Radim and Alex, what do
you think?

Thanks,
Haozhong

> 
> > Another question is how a user-space application (e.g., QEMU) knows
> > whether it's safe to mmap a file on the DAX file system?
> 
> I think we fix vaddr_get_pfn() to start failing for DAX mappings
> unless/until we can add a "with lease" mechanism. Userspace will know
> when it is safe again when vfio stops failing.
>