On Fri, Oct 30, 2015 at 01:55:54PM +0800, Xiao Guangrong wrote: > This patchset can be found at: > https://github.com/xiaogr/qemu.git nvdimm-v6 > > It is based on pci branch on Michael's tree and the top commit is: > commit 6f96a31a06c2a1 (tests: re-enable vhost-user-test). > > Changelog in v6: > - changes from Stefan's comments: > 1) fix code style of struct naming by CamelCase way > 2) fix offset + length overflow when read/write label data > 3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can > be used to replace getpagesize() > > Changelog in v5: > - changes from Michael's comments: > 1) prefix nvdimm_ to everything in NVDIMM source files > 2) make parsing _DSM Arg3 more clear > 3) comment style fix > 5) drop single used definition > 6) fix dirty dsm buffer lost due to memory write happened on host > 7) check dsm buffer if it is big enough to contain input data > 8) use build_append_int_noprefix to store single value to GArray > > - changes from Michael's and Igor's comments: > 1) introduce 'nvdimm-support' parameter to control nvdimm > enablement and it is disabled for 2.4 and its earlier versions > to make live migration compatible > 2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI > virtualization > > - changes from Stefan's comments: > 1) do endian adjustment for the buffer length > > - changes from Bharata B Rao's comments: > 1) fix compile on ppc > > - others: > 1) the buffer length is directly got from IO read rather than got > from dsm memory > 2) fix dirty label data lost due to memory write happened on host > > Changelog in v4: > - changes from Michael's comments: > 1) show the message, "Memory is not allocated from HugeTlbfs", if file > based memory is not allocated from hugetlbfs. > 2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState > from Machine. > 3) statically define UUID and make its operation more clear > 4) use GArray to build device structures to avoid potential buffer > overflow > 4) improve comments in the code > 5) improve code style > > - changes from Igor's comments: > 1) add NVDIMM ACPI spec document > 2) use serialized method to avoid Mutex > 3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c > 4) introduce a common ASL method used by _DSM for all devices to reduce > ACPI size > 5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU > it's better to upgrade QEMU to support Rev2 in the future > > - changes from Stefan's comments: > 1) copy input data from DSM memory to local buffer to avoid potential > issues as DSM memory is visible to guest. Output data is handled > in a similar way > > - changes from Dan's comments: > 1) drop static namespace as Linux has already supported label-less > nvdimm devices > > - changes from Vladimir's comments: > 1) print better message, "failed to get file size for %s, can't create > backend on it", if any file operation filed to obtain file size > > - others: > create a git repo on github.com for better review/test > > Also, thanks for Eric Blake's review on QAPI's side. > > Thank all of you to review this patchset. > > Changelog in v3: > There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo, > Michael for their valuable comments, the patchset finally gets better shape. > - changes from Igor's comments: > 1) abstract dimm device type from pc-dimm and create nvdimm device based on > dimm, then it uses memory backend device as nvdimm's memory and NUMA has > easily been implemented. > 2) let file-backend device support any kind of filesystem not only for > hugetlbfs and let it work on file not only for directory which is > achieved by extending 'mem-path' - if it's a directory then it works as > current behavior, otherwise if it's file then directly allocates memory > from it. > 3) we figure out a unused memory hole below 4G that is 0xFF00000 ~ > 0xFFF00000, this range is large enough for NVDIMM ACPI as build 64-bit > ACPI SSDT/DSDT table will break windows XP. > BTW, only make SSDT.rev = 2 can not work since the width is only depended > on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block) > in ACPI spec: > | Note: For compatibility with ACPI versions before ACPI 2.0, the bit > | width of Integer objects is dependent on the ComplianceRevision of the DSDT. > | If the ComplianceRevision is less than 2, all integers are restricted to 32 > | bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets > | the global integer width for all integers, including integers in SSDTs. > 4) use the lowest ACPI spec version to document AML terms. > 5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm" > > - changes from Stefan's comments: > 1) do not do endian adjustment in-place since _DSM memory is visible to guest > 2) use target platform's target page size instead of fixed PAGE_SIZE > definition > 3) lots of code style improvement and typo fixes. > 4) live migration fix > - changes from Paolo's comments: > 1) improve the name of memory region > > - other changes: > 1) return exact buffer size for _DSM method instead of the page size. > 2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm > devices. > 3) NUMA support > 4) implement _FIT method > 5) rename "configdata" to "reserve-label-data" > 6) simplify _DSM arg3 determination > 7) main changelog update to let it reflect v3. > > Changlog in v2: > - Use litten endian for DSM method, thanks for Stefan's suggestion > > - introduce a new parameter, @configdata, if it's false, Qemu will > build a static and readonly namespace in memory and use it serveing > for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no > reserved region is needed at the end of the @file, it is good for > the user who want to pass whole nvdimm device and make its data > completely be visible to guest > > - divide the source code into separated files and add maintain info > > BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will > be posted on next week > > ====== Background ====== > NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported > on Intel's platform. They are discovered via ACPI and configured by _DSM > method of NVDIMM device in ACPI. There has some supporting documents which > can be found at: > ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf > NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf > DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf > Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf > > Currently, the NVDIMM driver has been merged into upstream Linux Kernel and > this patchset tries to enable it in virtualization field > > ====== Design ====== > NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's > address space then CPU can directly access it as normal memory, another is > BLK which is used as block device to reduce the occupying of CPU address > space > > BLK mode accesses NVDIMM via Command Register window and Data Register window. > BLK virtualization has high workload since each sector access will cause at > least two VM-EXIT. So we currently only imperilment vPMEM in this patchset > > --- vPMEM design --- > We introduce a new device named "nvdimm", it uses memory backend device as > NVDIMM memory. The file in file-backend device can be a regular file and block > device. We can use any file when we do test or emulation, however, > in the real word, the files passed to guest are: > - the regular file in the filesystem with DAX enabled created on NVDIMM device > on host > - the raw PMEM device on host, e,g /dev/pmem0 > Memory access on the address created by mmap on these kinds of files can > directly reach NVDIMM device on host. > > --- vConfigure data area design --- > Each NVDIMM device has a configure data area which is used to store label > namespace data. In order to emulating this area, we divide the file into two > parts: > - first parts is (0, size - 128K], which is used as PMEM > - 128K at the end of the file, which is used as Label Data Area > So that the label namespace data can be persistent during power lose or system > failure. > > We also support passing the whole file to guest without reserve any region for > label data area which is achieved by "reserve-label-data" parameter - if it's > false then QEMU will build static and readonly namespace in memory and that > namespace contains the whole file size. The parameter is false on default. > > --- _DSM method design --- > _DSM in ACPI is used to configure NVDIMM, currently we only allow access of > label namespace data, i.e, Get Namespace Label Size (Function Index 4), > Get Namespace Label Data (Function Index 5) and Set Namespace Label Data > (Function Index 6) > > _DSM uses two pages to transfer data between ACPI and Qemu, the first page > is RAM-based used to save the input info of _DSM method and Qemu reuse it > store output info and another page is MMIO-based, ACPI write data to this > page to transfer the control to Qemu > > ====== Test ====== > In host > 1) create memory backed file, e.g # dd if=zero of=/tmp/nvdimm bs=1G count=10 > 2) append "-object memory-backend-file,share,id=mem1, > mem-path=/tmp/nvdimm -device nvdimm,memdev=mem1,reserve-label-data, > id=nv1" in QEMU command line > > In guest, download the latest upsteam kernel (4.2 merge window) and enable > ACPI_NFIT, LIBNVDIMM and BLK_DEV_PMEM. > 1) insmod drivers/nvdimm/libnvdimm.ko > 2) insmod drivers/acpi/nfit.ko > 3) insmod drivers/nvdimm/nd_btt.ko > 4) insmod drivers/nvdimm/nd_pmem.ko > You can see the whole nvdimm device used as a single namespace and /dev/pmem0 > appears. You can do whatever on /dev/pmem0 including DAX access. > > Currently Linux NVDIMM driver does not support namespace operation on this > kind of PMEM, apply below changes to support dynamical namespace: > > @@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *a > continue; > } > > - if (nfit_mem->bdw && nfit_mem->memdev_pmem) > + //if (nfit_mem->bdw && nfit_mem->memdev_pmem) > + if (nfit_mem->memdev_pmem) > flags |= NDD_ALIASING; > > You can append another NVDIMM device in guest and do: > # cd /sys/bus/nd/devices/ > # cd namespace1.0/ > # echo `uuidgen` > uuid > # echo `expr 1024 \* 1024 \* 128` > size > then reload nd.pmem.ko > > You can see /dev/pmem1 appears > > Xiao Guangrong (33): > acpi: add aml_derefof > acpi: add aml_sizeof > acpi: add aml_create_field > acpi: add aml_concatenate > acpi: add aml_object_type > acpi: add aml_method_serialized > util: introduce qemu_file_get_page_size() > exec: allow memory to be allocated from any kind of path > exec: allow file_ram_alloc to work on file > hostmem-file: clean up memory allocation > hostmem-file: use whole file size if possible > pc-dimm: remove DEFAULT_PC_DIMMSIZE > pc-dimm: make pc_existing_dimms_capacity static and rename it > pc-dimm: drop the prefix of pc-dimm > stubs: rename qmp_pc_dimm_device_list.c > pc-dimm: rename pc-dimm.c and pc-dimm.h > dimm: abstract dimm device from pc-dimm > dimm: get mapped memory region from DIMMDeviceClass->get_memory_region > dimm: keep the state of the whole backend memory > dimm: introduce realize callback > nvdimm: implement NVDIMM device abstract > docs: add NVDIMM ACPI documentation > nvdimm acpi: init the resource used by NVDIMM ACPI > nvdimm acpi: build ACPI NFIT table > nvdimm acpi: build ACPI nvdimm devices > nvdimm acpi: save arg3 for NVDIMM device _DSM method > nvdimm acpi: support function 0 > nvdimm acpi: support Get Namespace Label Size function > nvdimm acpi: support Get Namespace Label Data function > nvdimm acpi: support Set Namespace Label Data function > nvdimm: allow using whole backend memory as pmem > nvdimm acpi: support _FIT method > nvdimm: add maintain info > > MAINTAINERS | 7 + > backends/hostmem-file.c | 59 +- > default-configs/i386-softmmu.mak | 3 + > default-configs/mips-softmmu.mak | 1 + > default-configs/mips64-softmmu.mak | 1 + > default-configs/mips64el-softmmu.mak | 1 + > default-configs/mipsel-softmmu.mak | 1 + > default-configs/ppc64-softmmu.mak | 1 + > default-configs/x86_64-softmmu.mak | 3 + > docs/specs/acpi_nvdimm.txt | 179 ++++ > exec.c | 114 +- > hmp.c | 2 +- > hw/Makefile.objs | 2 +- > hw/acpi/Makefile.objs | 1 + > hw/acpi/aml-build.c | 79 +- > hw/acpi/ich9.c | 32 +- > hw/acpi/memory_hotplug.c | 26 +- > hw/acpi/nvdimm.c | 1132 ++++++++++++++++++++ > hw/acpi/piix4.c | 35 +- > hw/i386/acpi-build.c | 6 + > hw/i386/pc.c | 34 +- > hw/mem/Makefile.objs | 2 + > hw/mem/{pc-dimm.c => dimm.c} | 240 +++-- > hw/mem/nvdimm.c | 142 +++ > hw/mem/pc-dimm.c | 510 +-------- > hw/ppc/spapr.c | 20 +- > include/hw/acpi/aml-build.h | 7 + > include/hw/acpi/ich9.h | 3 + > include/hw/i386/pc.h | 12 +- > include/hw/mem/dimm.h | 95 ++ > include/hw/mem/nvdimm.h | 133 +++ > include/hw/mem/pc-dimm.h | 104 +- > include/hw/ppc/spapr.h | 2 +- > include/qemu/osdep.h | 1 + > numa.c | 4 +- > qapi-schema.json | 8 +- > qmp.c | 4 +- > stubs/Makefile.objs | 2 +- > ...c_dimm_device_list.c => qmp_dimm_device_list.c} | 4 +- > target-ppc/kvm.c | 21 +- > trace-events | 8 +- > util/oslib-posix.c | 16 + > util/oslib-win32.c | 5 + > 43 files changed, 2224 insertions(+), 838 deletions(-) > create mode 100644 docs/specs/acpi_nvdimm.txt > create mode 100644 hw/acpi/nvdimm.c > rename hw/mem/{pc-dimm.c => dimm.c} (65%) > create mode 100644 hw/mem/nvdimm.c > rewrite hw/mem/pc-dimm.c (91%) > create mode 100644 include/hw/mem/dimm.h > create mode 100644 include/hw/mem/nvdimm.h > rewrite include/hw/mem/pc-dimm.h (97%) > rename stubs/{qmp_pc_dimm_device_list.c => qmp_dimm_device_list.c} (56%) I've reviewed the interface that ACPI inside the guest uses to communicate with QEMU. I haven't reviewed the actual ACPI generation or pc-dimm device model parts. Reviewed-by: Stefan Hajnoczi <stefanha@xxxxxxxxxx>
Attachment:
signature.asc
Description: PGP signature