Re: [PATCH v4 0/3] nvdimm: Enable sync-dax property for nvdimm

Shivaprasad G Bhat <sbhat@xxxxxxxxxxxxx> · Mon, 3 May 2021 19:35:21 +0530

On 5/1/21 12:44 AM, Dan Williams wrote:
Some corrections to terminology confusion below...

On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat <sbhat@xxxxxxxxxxxxx> wrote:
The nvdimm devices are expected to ensure write persistence during power
failure kind of scenarios.
No, QEMU is not expected to make that guarantee. QEMU is free to lie
to the guest about the persistence guarantees of the guest PMEM
ranges. It's more accurate to say that QEMU nvdimm devices can emulate
persistent memory and optionally pass through host power-fail
persistence guarantees to the guest. The power-fail persistence domain
can be one of "cpu_cache", or "memory_controller" if the persistent
memory region is "synchronous". If the persistent range is not
synchronous, it really isn't "persistent memory"; it's memory mapped
storage that needs I/O commands to flush.

Since this is virtual nvdimm(v-nvdimm) backed by a file, and the data is 
completely

in the host pagecache, and we need a way to ensure that host pagecaches

are flushed to the backend. This analogous to the WPQ flush being offloaded

to the hypervisor.

Ref: https://github.com/dgibson/qemu/blob/main/docs/nvdimm.txt

The libpmem has architecture specific instructions like dcbf on POWER
Which "libpmem" is this? PMDK is a reference library not a PMEM
interface... maybe I'm missing what libpmem has to do with QEMU?

I was referrering to semantics of flushing pmem cache lines as in

PMDK/libpmem.

to flush the cache data to backend nvdimm device during normal writes
followed by explicit flushes if the backend devices are not synchronous
DAX capable.

Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
and the subsequent flush doesn't traslate to actual flush to the backend
s/traslate/translate/

file on the host in case of file backed v-nvdimms. This is addressed by
virtio-pmem in case of x86_64 by making explicit flushes translating to
fsync at qemu.
Note that virtio-pmem was a proposal for a specific optimization of
allowing guests to share page cache. The virtio-pmem approach is not
to be confused with actual persistent memory.

On SPAPR, the issue is addressed by adding a new hcall to
request for an explicit flush from the guest ndctl driver when the backend
What is an "ndctl" driver? ndctl is userspace tooling, do you mean the
guest pmem driver?

oops, wrong terminologies. I was referring to guest libnvdimm and

papr_scm kernel modules.

nvdimm cannot ensure write persistence with dcbf alone. So, the approach
here is to convey when the hcall flush is required in a device tree
property. The guest makes the hcall when the property is found, instead
of relying on dcbf.

A new device property sync-dax is added to the nvdimm device. When the
sync-dax is 'writeback'(default for PPC), device property
"hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH
requesting for an explicit flush.
I'm not sure "sync-dax" is a suitable name for the property of the
guest persistent memory.

sync-dax property translates ND_REGION_ASYNC flag being set/unset

for the pmem region also if the nvdimm_flush callback is provided in the

papr_scm or not. As everything boils down to synchronous nature

of the device, I chose sync-dax for the name.

  There is no requirement that the
memory-backend file for a guest be a dax-capable file. It's also
implementation specific what hypercall needs to be invoked for a given
occurrence of "sync-dax". What does that map to on non-PPC platforms
for example?

The backend file can be dax-capable, to be hinted using "sync-dax=direct".

When the backend is not dax-capable, the "sync-dax=writeback" to used,

so that the guest makes the hcall. On all non-PPC archs, with the

"sync-dax=writeback" qemu errors out stating the lack of support.

  It seems to me that an "nvdimm" device presents the
synchronous usage model and a whole other device type implements an
async-hypercall setup that the guest happens to service with its
nvdimm stack, but it's not an "nvdimm" anymore at that point.

In case the file backing the v-nvdimm is not dax-capable, we need flush

semantics on the guest to be mapped to pagecache flush on the host side.

sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines
prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented
now as the flush semantics are unimplemented.
"sync-dax" has no meaning on its own, I think this needs an explicit
mechanism to convey both the "not-sync" property *and* the callback
method, it shouldn't be inferred by arch type.

Yes. On all platforms the "sync-dax=unsafe" meaning - with host power

failure the host pagecache is lost and subsequently data written by the

guest will also be gone. This is the default for non-PPC.

On PPC, the default is "sync-dax=writeback" - so the ND_REGION_ASYNC

is set for the region and the guest makes hcalls to issue fsync on the host.

Are you suggesting me to keep it "unsafe" as default for all architectures

including PPC and a user can set it to "writeback" if desired.

When the backend file is actually synchronous DAX capable and no explicit
flushes are required, the sync-dax mode 'direct' is to be used.

The below demonstration shows the map_sync behavior with sync-dax writeback &
direct.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from nvdimm with With sync-dax=direct, and pmem1 is from
nvdimm with syn-dax=writeback, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

[root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When sync-dax=unsafe/direct
[root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when sync-dax=writeback
Failed to mmap  with Operation not supported

The first patch does the header file cleanup necessary for the
subsequent ones. Second patch implements the hcall, adds the necessary
vmstate properties to spapr machine structure for carrying the hcall
status during save-restore. The nature of the hcall being asynchronus,
the patch uses aio utilities to offload the flush. The third patch adds
the 'sync-dax' device property and enables the device tree property
for the guest to utilise the hcall.

The kernel changes to exploit this hcall is at
https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch

---
v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
Changes from v3:
       - Fixed the forward declaration coding guideline violations in 1st patch.
       - Removed the code waiting for the flushes to complete during migration,
         instead restart the flush worker on destination qemu in post load.
       - Got rid of the randomization of the flush tokens, using simple
         counter.
       - Got rid of the redundant flush state lock, relying on the BQL now.
       - Handling the memory-backend-ram usage
       - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
         Added prevention code using 'writeback' on arm and x86_64.
       - Fixed all the miscellaneous comments.

v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
Changes from v2:
       - Using the thread pool based approach as suggested
       - Moved the async hcall handling code to spapr_nvdimm.c along
         with some simplifications
       - Added vmstate to preserve the hcall status during save-restore
         along with pre_save handler code to complete all ongoning flushes.
       - Added hw_compat magic for sync-dax 'on' on previous machines.
       - Miscellanious minor fixes.

v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
       - Fixed a missed-out unlock
       - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token

Shivaprasad G Bhat (3):
       spapr: nvdimm: Forward declare and move the definitions
       spapr: nvdimm: Implement H_SCM_FLUSH hcall
       nvdimm: Enable sync-dax device property for nvdimm

  hw/arm/virt.c                 |   28 ++++
  hw/i386/pc.c                  |   28 ++++
  hw/mem/nvdimm.c               |   52 +++++++
  hw/ppc/spapr.c                |   16 ++
  hw/ppc/spapr_nvdimm.c         |  285 +++++++++++++++++++++++++++++++++++++++++
  include/hw/mem/nvdimm.h       |   11 ++
  include/hw/ppc/spapr.h        |   11 +-
  include/hw/ppc/spapr_nvdimm.h |   27 ++--
  qapi/common.json              |   20 +++
  9 files changed, 455 insertions(+), 23 deletions(-)

--
Signature
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@xxxxxxxxxxxx
To unsubscribe send an email to linux-nvdimm-leave@xxxxxxxxxxxx
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@xxxxxxxxxxxx
To unsubscribe send an email to linux-nvdimm-leave@xxxxxxxxxxxx