On Mon, Nov 16, 2015 at 06:50:58PM +0800, Xiao Guangrong wrote: > This patchset can be found at: > https://github.com/xiaogr/qemu.git nvdimm-v8 > > It is based on pci branch on Michael's tree and the top commit is: > commit e3a4e177d9 (migration/ram: fix build on 32 bit hosts). > > Changelog in v8: > We split the long patch series into the small parts, as you see now, this > is the first part which enables NVDIMM without label data support. > > The command line has been changed because some patches simplifying the > things have not been included into this series, you should specify the > file size exactly using the parameters as follows: > memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G \ > -device nvdimm,memdev=mem1,id=nv1 > > Changelog in v7: > - changes from Vladimir Sementsov-Ogievskiy's comments: > 1) let gethugepagesize() realize if fstat is failed instead of get > normal page size > 2) rename open_file_path to open_ram_file_path > 3) better log the error message by using error_setg_errno > 4) update commit in the commit log to explain hugepage detection on > Windows > > - changes from Eduardo Habkost's comments: > 1) use 'Error**' to collect error message for qemu_file_get_page_size() > 2) move gethugepagesize() replacement to the same patch to make it > better for review > 3) introduce qemu_get_file_size to unity the code with raw_getlength() > > - changes from Stefan's comments: > 1) check the memory region is large enough to contain DSM output > buffer > > - changes from Eric Blake's comments: > 1) update the shell command in the commit log to generate the patch > which drops 'pc-dimm' prefix > > - others: > pick up Reviewed-by from Stefan, Vladimir Sementsov-Ogievskiy, and > Eric Blake. > > Changelog in v6: > - changes from Stefan's comments: > 1) fix code style of struct naming by CamelCase way > 2) fix offset + length overflow when read/write label data > 3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can > be used to replace getpagesize() > > Changelog in v5: > - changes from Michael's comments: > 1) prefix nvdimm_ to everything in NVDIMM source files > 2) make parsing _DSM Arg3 more clear > 3) comment style fix > 5) drop single used definition > 6) fix dirty dsm buffer lost due to memory write happened on host > 7) check dsm buffer if it is big enough to contain input data > 8) use build_append_int_noprefix to store single value to GArray > > - changes from Michael's and Igor's comments: > 1) introduce 'nvdimm-support' parameter to control nvdimm > enablement and it is disabled for 2.4 and its earlier versions > to make live migration compatible > 2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI > virtualization > > - changes from Stefan's comments: > 1) do endian adjustment for the buffer length > > - changes from Bharata B Rao's comments: > 1) fix compile on ppc > > - others: > 1) the buffer length is directly got from IO read rather than got > from dsm memory > 2) fix dirty label data lost due to memory write happened on host > > Changelog in v4: > - changes from Michael's comments: > 1) show the message, "Memory is not allocated from HugeTlbfs", if file > based memory is not allocated from hugetlbfs. > 2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState > from Machine. > 3) statically define UUID and make its operation more clear > 4) use GArray to build device structures to avoid potential buffer > overflow > 4) improve comments in the code > 5) improve code style > > - changes from Igor's comments: > 1) add NVDIMM ACPI spec document > 2) use serialized method to avoid Mutex > 3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c > 4) introduce a common ASL method used by _DSM for all devices to reduce > ACPI size > 5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU > it's better to upgrade QEMU to support Rev2 in the future > > - changes from Stefan's comments: > 1) copy input data from DSM memory to local buffer to avoid potential > issues as DSM memory is visible to guest. Output data is handled > in a similar way > > - changes from Dan's comments: > 1) drop static namespace as Linux has already supported label-less > nvdimm devices > > - changes from Vladimir's comments: > 1) print better message, "failed to get file size for %s, can't create > backend on it", if any file operation filed to obtain file size > > - others: > create a git repo on github.com for better review/test > > Also, thanks for Eric Blake's review on QAPI's side. > > Thank all of you to review this patchset. > > Changelog in v3: > There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo, > Michael for their valuable comments, the patchset finally gets better shape. > - changes from Igor's comments: > 1) abstract dimm device type from pc-dimm and create nvdimm device based on > dimm, then it uses memory backend device as nvdimm's memory and NUMA has > easily been implemented. > 2) let file-backend device support any kind of filesystem not only for > hugetlbfs and let it work on file not only for directory which is > achieved by extending 'mem-path' - if it's a directory then it works as > current behavior, otherwise if it's file then directly allocates memory > from it. > 3) we figure out a unused memory hole below 4G that is 0xFF00000 ~ > 0xFFF00000, this range is large enough for NVDIMM ACPI as build 64-bit > ACPI SSDT/DSDT table will break windows XP. > BTW, only make SSDT.rev = 2 can not work since the width is only depended > on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block) > in ACPI spec: > | Note: For compatibility with ACPI versions before ACPI 2.0, the bit > | width of Integer objects is dependent on the ComplianceRevision of the DSDT. > | If the ComplianceRevision is less than 2, all integers are restricted to 32 > | bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets > | the global integer width for all integers, including integers in SSDTs. > 4) use the lowest ACPI spec version to document AML terms. > 5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm" > > - changes from Stefan's comments: > 1) do not do endian adjustment in-place since _DSM memory is visible to guest > 2) use target platform's target page size instead of fixed PAGE_SIZE > definition > 3) lots of code style improvement and typo fixes. > 4) live migration fix > - changes from Paolo's comments: > 1) improve the name of memory region > > - other changes: > 1) return exact buffer size for _DSM method instead of the page size. > 2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm > devices. > 3) NUMA support > 4) implement _FIT method > 5) rename "configdata" to "reserve-label-data" > 6) simplify _DSM arg3 determination > 7) main changelog update to let it reflect v3. > > Changlog in v2: > - Use litten endian for DSM method, thanks for Stefan's suggestion > > - introduce a new parameter, @configdata, if it's false, Qemu will > build a static and readonly namespace in memory and use it serveing > for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no > reserved region is needed at the end of the @file, it is good for > the user who want to pass whole nvdimm device and make its data > completely be visible to guest > > - divide the source code into separated files and add maintain info > > BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will > be posted on next week > > ====== Background ====== > NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported > on Intel's platform. They are discovered via ACPI and configured by _DSM > method of NVDIMM device in ACPI. There has some supporting documents which > can be found at: > ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf > NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf > DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf > Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf > > Currently, the NVDIMM driver has been merged into upstream Linux Kernel and > this patchset tries to enable it in virtualization field > > ====== Design ====== > NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's > address space then CPU can directly access it as normal memory, another is > BLK which is used as block device to reduce the occupying of CPU address > space > > BLK mode accesses NVDIMM via Command Register window and Data Register window. > BLK virtualization has high workload since each sector access will cause at > least two VM-EXIT. So we currently only imperilment vPMEM in this patchset > > --- vPMEM design --- > We introduce a new device named "nvdimm", it uses memory backend device as > NVDIMM memory. The file in file-backend device can be a regular file and block > device. We can use any file when we do test or emulation, however, > in the real word, the files passed to guest are: > - the regular file in the filesystem with DAX enabled created on NVDIMM device > on host > - the raw PMEM device on host, e,g /dev/pmem0 > Memory access on the address created by mmap on these kinds of files can > directly reach NVDIMM device on host. > > --- vConfigure data area design --- > Each NVDIMM device has a configure data area which is used to store label > namespace data. In order to emulating this area, we divide the file into two > parts: > - first parts is (0, size - 128K], which is used as PMEM > - 128K at the end of the file, which is used as Label Data Area > So that the label namespace data can be persistent during power lose or system > failure. > > We also support passing the whole file to guest without reserve any region for > label data area which is achieved by "reserve-label-data" parameter - if it's > false then QEMU will build static and readonly namespace in memory and that > namespace contains the whole file size. The parameter is false on default. > > --- _DSM method design --- > _DSM in ACPI is used to configure NVDIMM, currently we only allow access of > label namespace data, i.e, Get Namespace Label Size (Function Index 4), > Get Namespace Label Data (Function Index 5) and Set Namespace Label Data > (Function Index 6) > > _DSM uses two pages to transfer data between ACPI and Qemu, the first page > is RAM-based used to save the input info of _DSM method and Qemu reuse it > store output info and another page is MMIO-based, ACPI write data to this > page to transfer the control to Qemu > > ====== Test ====== > In host > 1) create memory backed file, e.g # dd if=zero of=/tmp/nvdimm bs=1G count=10 > 2) append "-object memory-backend-file,share,id=mem1, > mem-path=/tmp/nvdimm -device nvdimm,memdev=mem1,reserve-label-data, > id=nv1" in QEMU command line > > In guest, download the latest upsteam kernel (4.2 merge window) and enable > ACPI_NFIT, LIBNVDIMM and BLK_DEV_PMEM. > 1) insmod drivers/nvdimm/libnvdimm.ko > 2) insmod drivers/acpi/nfit.ko > 3) insmod drivers/nvdimm/nd_btt.ko > 4) insmod drivers/nvdimm/nd_pmem.ko > You can see the whole nvdimm device used as a single namespace and /dev/pmem0 > appears. You can do whatever on /dev/pmem0 including DAX access. > > Currently Linux NVDIMM driver does not support namespace operation on this > kind of PMEM, apply below changes to support dynamical namespace: > > @@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *a > continue; > } > > - if (nfit_mem->bdw && nfit_mem->memdev_pmem) > + //if (nfit_mem->bdw && nfit_mem->memdev_pmem) > + if (nfit_mem->memdev_pmem) > flags |= NDD_ALIASING; > > You can append another NVDIMM device in guest and do: > # cd /sys/bus/nd/devices/ > # cd namespace1.0/ > # echo `uuidgen` > uuid > # echo `expr 1024 \* 1024 \* 128` > size > then reload nd.pmem.ko > > You can see /dev/pmem1 appears > > Xiao Guangrong (5): > nvdimm: implement NVDIMM device abstract > acpi: support specified oem table id for build_header > nvdimm acpi: build ACPI NFIT table > nvdimm acpi: build ACPI nvdimm devices > nvdimm: add maintain info > > MAINTAINERS | 7 + > default-configs/i386-softmmu.mak | 2 + > default-configs/x86_64-softmmu.mak | 2 + > hw/acpi/Makefile.objs | 1 + > hw/acpi/aml-build.c | 15 +- > hw/acpi/ich9.c | 19 ++ > hw/acpi/memory_hotplug.c | 5 + > hw/acpi/nvdimm.c | 467 +++++++++++++++++++++++++++++++++++++ > hw/acpi/piix4.c | 4 + > hw/arm/virt-acpi-build.c | 13 +- > hw/i386/acpi-build.c | 26 ++- > hw/mem/Makefile.objs | 1 + > hw/mem/nvdimm.c | 46 ++++ > include/hw/acpi/aml-build.h | 3 +- > include/hw/acpi/ich9.h | 3 + > include/hw/i386/pc.h | 12 +- > include/hw/mem/nvdimm.h | 41 ++++ > 17 files changed, 645 insertions(+), 22 deletions(-) > create mode 100644 hw/acpi/nvdimm.c > create mode 100644 hw/mem/nvdimm.c > create mode 100644 include/hw/mem/nvdimm.h Reviewed-by: Stefan Hajnoczi <stefanha@xxxxxxxxxx>
Attachment:
signature.asc
Description: PGP signature