Re: [PATCH v6 00/33] implement vNVDIMM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Oct 30, 2015 at 01:55:54PM +0800, Xiao Guangrong wrote:
> This patchset can be found at:
>       https://github.com/xiaogr/qemu.git nvdimm-v6
> 
> It is based on pci branch on Michael's tree and the top commit is:
> commit 6f96a31a06c2a1 (tests: re-enable vhost-user-test).
> 
> Changelog in v6:
> - changes from Stefan's comments:
>   1) fix code style of struct naming by CamelCase way
>   2) fix offset + length overflow when read/write label data
>   3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can
>      be used to replace getpagesize()
> 
> Changelog in v5:
> - changes from Michael's comments:
>   1) prefix nvdimm_ to everything in NVDIMM source files
>   2) make parsing _DSM Arg3 more clear
>   3) comment style fix
>   5) drop single used definition
>   6) fix dirty dsm buffer lost due to memory write happened on host
>   7) check dsm buffer if it is big enough to contain input data
>   8) use build_append_int_noprefix to store single value to GArray
> 
> - changes from Michael's and Igor's comments:
>   1) introduce 'nvdimm-support' parameter to control nvdimm
>      enablement and it is disabled for 2.4 and its earlier versions
>      to make live migration compatible
>   2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
>      virtualization
> 
> - changes from Stefan's comments:
>   1) do endian adjustment for the buffer length
> 
> - changes from Bharata B Rao's comments:
>   1) fix compile on ppc
> 
> - others:
>   1) the buffer length is directly got from IO read rather than got
>      from dsm memory
>   2) fix dirty label data lost due to memory write happened on host
> 
> Changelog in v4:
> - changes from Michael's comments:
>   1) show the message, "Memory is not allocated from HugeTlbfs", if file
>      based memory is not allocated from hugetlbfs.
>   2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
>      from Machine.
>   3) statically define UUID and make its operation more clear
>   4) use GArray to build device structures to avoid potential buffer
>      overflow
>   4) improve comments in the code
>   5) improve code style
> 
> - changes from Igor's comments:
>   1) add NVDIMM ACPI spec document
>   2) use serialized method to avoid Mutex
>   3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
>   4) introduce a common ASL method used by _DSM for all devices to reduce
>      ACPI size
>   5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
>      it's better to upgrade QEMU to support Rev2 in the future
> 
> - changes from Stefan's comments:
>   1) copy input data from DSM memory to local buffer to avoid potential
>      issues as DSM memory is visible to guest. Output data is handled
>      in a similar way
> 
> - changes from Dan's comments:
>   1) drop static namespace as Linux has already supported label-less
>      nvdimm devices
> 
> - changes from Vladimir's comments:
>   1) print better message, "failed to get file size for %s, can't create
>      backend on it", if any file operation filed to obtain file size
> 
> - others:
>   create a git repo on github.com for better review/test
> 
> Also, thanks for Eric Blake's review on QAPI's side.
> 
> Thank all of you to review this patchset.
> 
> Changelog in v3:
> There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
> Michael for their valuable comments, the patchset finally gets better shape.
> - changes from Igor's comments:
>   1) abstract dimm device type from pc-dimm and create nvdimm device based on
>      dimm, then it uses memory backend device as nvdimm's memory and NUMA has
>      easily been implemented.
>   2) let file-backend device support any kind of filesystem not only for
>      hugetlbfs and let it work on file not only for directory which is
>      achieved by extending 'mem-path' - if it's a directory then it works as
>      current behavior, otherwise if it's file then directly allocates memory
>      from it.
>   3) we figure out a unused memory hole below 4G that is 0xFF00000 ~ 
>      0xFFF00000, this range is large enough for NVDIMM ACPI as build 64-bit
>      ACPI SSDT/DSDT table will break windows XP.
>      BTW, only make SSDT.rev = 2 can not work since the width is only depended
>      on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block)
>      in ACPI spec:
> | Note: For compatibility with ACPI versions before ACPI 2.0, the bit 
> | width of Integer objects is dependent on the ComplianceRevision of the DSDT.
> | If the ComplianceRevision is less than 2, all integers are restricted to 32 
> | bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets 
> | the global integer width for all integers, including integers in SSDTs.
>   4) use the lowest ACPI spec version to document AML terms.
>   5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm"
> 
> - changes from Stefan's comments:
>   1) do not do endian adjustment in-place since _DSM memory is visible to guest
>   2) use target platform's target page size instead of fixed PAGE_SIZE
>      definition
>   3) lots of code style improvement and typo fixes.
>   4) live migration fix
> - changes from Paolo's comments:
>   1) improve the name of memory region
>   
> - other changes:
>   1) return exact buffer size for _DSM method instead of the page size.
>   2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm
>      devices.
>   3) NUMA support
>   4) implement _FIT method
>   5) rename "configdata" to "reserve-label-data"
>   6) simplify _DSM arg3 determination
>   7) main changelog update to let it reflect v3.
> 
> Changlog in v2:
> - Use litten endian for DSM method, thanks for Stefan's suggestion
> 
> - introduce a new parameter, @configdata, if it's false, Qemu will
>   build a static and readonly namespace in memory and use it serveing
>   for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no
>   reserved region is needed at the end of the @file, it is good for
>   the user who want to pass whole nvdimm device and make its data
>   completely be visible to guest
> 
> - divide the source code into separated files and add maintain info
> 
> BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will
> be posted on next week
> 
> ====== Background ======
> NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported
> on Intel's platform. They are discovered via ACPI and configured by _DSM
> method of NVDIMM device in ACPI. There has some supporting documents which
> can be found at:
> ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> 
> Currently, the NVDIMM driver has been merged into upstream Linux Kernel and
> this patchset tries to enable it in virtualization field
> 
> ====== Design ======
> NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's
> address space then CPU can directly access it as normal memory, another is
> BLK which is used as block device to reduce the occupying of CPU address
> space
> 
> BLK mode accesses NVDIMM via Command Register window and Data Register window.
> BLK virtualization has high workload since each sector access will cause at
> least two VM-EXIT. So we currently only imperilment vPMEM in this patchset
> 
> --- vPMEM design ---
> We introduce a new device named "nvdimm", it uses memory backend device as
> NVDIMM memory. The file in file-backend device can be a regular file and block 
> device. We can use any file when we do test or emulation, however,
> in the real word, the files passed to guest are:
> - the regular file in the filesystem with DAX enabled created on NVDIMM device
>   on host
> - the raw PMEM device on host, e,g /dev/pmem0
> Memory access on the address created by mmap on these kinds of files can
> directly reach NVDIMM device on host.
> 
> --- vConfigure data area design ---
> Each NVDIMM device has a configure data area which is used to store label
> namespace data. In order to emulating this area, we divide the file into two
> parts:
> - first parts is (0, size - 128K], which is used as PMEM
> - 128K at the end of the file, which is used as Label Data Area
> So that the label namespace data can be persistent during power lose or system
> failure.
> 
> We also support passing the whole file to guest without reserve any region for
> label data area which is achieved by "reserve-label-data" parameter - if it's
> false then QEMU will build static and readonly namespace in memory and that
> namespace contains the whole file size. The parameter is false on default.
> 
> --- _DSM method design ---
> _DSM in ACPI is used to configure NVDIMM, currently we only allow access of
> label namespace data, i.e, Get Namespace Label Size (Function Index 4),
> Get Namespace Label Data (Function Index 5) and Set Namespace Label Data
> (Function Index 6)
> 
> _DSM uses two pages to transfer data between ACPI and Qemu, the first page
> is RAM-based used to save the input info of _DSM method and Qemu reuse it
> store output info and another page is MMIO-based, ACPI write data to this
> page to transfer the control to Qemu
> 
> ====== Test ======
> In host
> 1) create memory backed file, e.g # dd if=zero of=/tmp/nvdimm bs=1G count=10
> 2) append "-object memory-backend-file,share,id=mem1,
>    mem-path=/tmp/nvdimm -device nvdimm,memdev=mem1,reserve-label-data,
>    id=nv1" in QEMU command line
> 
> In guest, download the latest upsteam kernel (4.2 merge window) and enable
> ACPI_NFIT, LIBNVDIMM and BLK_DEV_PMEM.
> 1) insmod drivers/nvdimm/libnvdimm.ko
> 2) insmod drivers/acpi/nfit.ko
> 3) insmod drivers/nvdimm/nd_btt.ko
> 4) insmod drivers/nvdimm/nd_pmem.ko
> You can see the whole nvdimm device used as a single namespace and /dev/pmem0
> appears. You can do whatever on /dev/pmem0 including DAX access.
> 
> Currently Linux NVDIMM driver does not support namespace operation on this
> kind of PMEM, apply below changes to support dynamical namespace:
> 
> @@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *a
>                         continue;
>                 }
>  
> -               if (nfit_mem->bdw && nfit_mem->memdev_pmem)
> +               //if (nfit_mem->bdw && nfit_mem->memdev_pmem)
> +               if (nfit_mem->memdev_pmem)
>                         flags |= NDD_ALIASING;
> 
> You can append another NVDIMM device in guest and do:                       
> # cd /sys/bus/nd/devices/
> # cd namespace1.0/
> # echo `uuidgen` > uuid
> # echo `expr 1024 \* 1024 \* 128` > size
> then reload nd.pmem.ko
> 
> You can see /dev/pmem1 appears
> 
> Xiao Guangrong (33):
>   acpi: add aml_derefof
>   acpi: add aml_sizeof
>   acpi: add aml_create_field
>   acpi: add aml_concatenate
>   acpi: add aml_object_type
>   acpi: add aml_method_serialized
>   util: introduce qemu_file_get_page_size()
>   exec: allow memory to be allocated from any kind of path
>   exec: allow file_ram_alloc to work on file
>   hostmem-file: clean up memory allocation
>   hostmem-file: use whole file size if possible
>   pc-dimm: remove DEFAULT_PC_DIMMSIZE
>   pc-dimm: make pc_existing_dimms_capacity static and rename it
>   pc-dimm: drop the prefix of pc-dimm
>   stubs: rename qmp_pc_dimm_device_list.c
>   pc-dimm: rename pc-dimm.c and pc-dimm.h
>   dimm: abstract dimm device from pc-dimm
>   dimm: get mapped memory region from DIMMDeviceClass->get_memory_region
>   dimm: keep the state of the whole backend memory
>   dimm: introduce realize callback
>   nvdimm: implement NVDIMM device abstract
>   docs: add NVDIMM ACPI documentation
>   nvdimm acpi: init the resource used by NVDIMM ACPI
>   nvdimm acpi: build ACPI NFIT table
>   nvdimm acpi: build ACPI nvdimm devices
>   nvdimm acpi: save arg3 for NVDIMM device _DSM method
>   nvdimm acpi: support function 0
>   nvdimm acpi: support Get Namespace Label Size function
>   nvdimm acpi: support Get Namespace Label Data function
>   nvdimm acpi: support Set Namespace Label Data function
>   nvdimm: allow using whole backend memory as pmem
>   nvdimm acpi: support _FIT method
>   nvdimm: add maintain info
> 
>  MAINTAINERS                                        |    7 +
>  backends/hostmem-file.c                            |   59 +-
>  default-configs/i386-softmmu.mak                   |    3 +
>  default-configs/mips-softmmu.mak                   |    1 +
>  default-configs/mips64-softmmu.mak                 |    1 +
>  default-configs/mips64el-softmmu.mak               |    1 +
>  default-configs/mipsel-softmmu.mak                 |    1 +
>  default-configs/ppc64-softmmu.mak                  |    1 +
>  default-configs/x86_64-softmmu.mak                 |    3 +
>  docs/specs/acpi_nvdimm.txt                         |  179 ++++
>  exec.c                                             |  114 +-
>  hmp.c                                              |    2 +-
>  hw/Makefile.objs                                   |    2 +-
>  hw/acpi/Makefile.objs                              |    1 +
>  hw/acpi/aml-build.c                                |   79 +-
>  hw/acpi/ich9.c                                     |   32 +-
>  hw/acpi/memory_hotplug.c                           |   26 +-
>  hw/acpi/nvdimm.c                                   | 1132 ++++++++++++++++++++
>  hw/acpi/piix4.c                                    |   35 +-
>  hw/i386/acpi-build.c                               |    6 +
>  hw/i386/pc.c                                       |   34 +-
>  hw/mem/Makefile.objs                               |    2 +
>  hw/mem/{pc-dimm.c => dimm.c}                       |  240 +++--
>  hw/mem/nvdimm.c                                    |  142 +++
>  hw/mem/pc-dimm.c                                   |  510 +--------
>  hw/ppc/spapr.c                                     |   20 +-
>  include/hw/acpi/aml-build.h                        |    7 +
>  include/hw/acpi/ich9.h                             |    3 +
>  include/hw/i386/pc.h                               |   12 +-
>  include/hw/mem/dimm.h                              |   95 ++
>  include/hw/mem/nvdimm.h                            |  133 +++
>  include/hw/mem/pc-dimm.h                           |  104 +-
>  include/hw/ppc/spapr.h                             |    2 +-
>  include/qemu/osdep.h                               |    1 +
>  numa.c                                             |    4 +-
>  qapi-schema.json                                   |    8 +-
>  qmp.c                                              |    4 +-
>  stubs/Makefile.objs                                |    2 +-
>  ...c_dimm_device_list.c => qmp_dimm_device_list.c} |    4 +-
>  target-ppc/kvm.c                                   |   21 +-
>  trace-events                                       |    8 +-
>  util/oslib-posix.c                                 |   16 +
>  util/oslib-win32.c                                 |    5 +
>  43 files changed, 2224 insertions(+), 838 deletions(-)
>  create mode 100644 docs/specs/acpi_nvdimm.txt
>  create mode 100644 hw/acpi/nvdimm.c
>  rename hw/mem/{pc-dimm.c => dimm.c} (65%)
>  create mode 100644 hw/mem/nvdimm.c
>  rewrite hw/mem/pc-dimm.c (91%)
>  create mode 100644 include/hw/mem/dimm.h
>  create mode 100644 include/hw/mem/nvdimm.h
>  rewrite include/hw/mem/pc-dimm.h (97%)
>  rename stubs/{qmp_pc_dimm_device_list.c => qmp_dimm_device_list.c} (56%)

I've reviewed the interface that ACPI inside the guest uses to
communicate with QEMU.  I haven't reviewed the actual ACPI generation or
pc-dimm device model parts.

Reviewed-by: Stefan Hajnoczi <stefanha@xxxxxxxxxx>

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux