====== Background ====== NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported on Intel's platform. They are discovered via ACPI and configured by _DSM method of NVDIMM device in ACPI. There has some supporting documents which can be found at: ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf Currently, the NVDIMM driver has been merged into upstream Linux Kernel and this patchset tries to enable it in virtualization field ====== Design ====== NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's address space then CPU can directly access it as normal memory, another is BLK which is used as block device to reduce the occupying of CPU address space BLK mode accesses NVDIMM via Command Register window and Data Register window. BLK virtualization has high workload since each sector access will cause at least two VM-EXIT. So we currently only imperilment vPMEM in this patchset --- vPMEM design --- We introduce a new device named "pc-nvdimm", it has a parameter, file, which is the file-based backed memory passed to guest. The file can be regular file and block device. We can use any file when we do test or emulation, however, in the real word, the files passed to guest are: - the regular file in the filesystem with DAX enabled created on NVDIMM device on host - the raw PMEM device on host, e,g /dev/pmem0 Memory access on the address created by mmap on these kinds of files can directly reach NVDIMM device on host. --- vConfigure data area design --- Each NVDIMM device has a configure data area which is used to store label namespace data. In order to emulating this area, we divide the file into two parts: - first parts is (0, size - 128K], which is used as PMEM - 128K at the end of the file, which is used as Config Data Area So that the label namespace data can be persistent during power lose or system failure --- _DSM method design --- _DSM in ACPI is used to configure NVDIMM, currently we only allow access of label namespace data, i.e, Get Namespace Label Size (Function Index 4), Get Namespace Label Data (Function Index 5) and Set Namespace Label Data (Function Index 6) _DSM uses two pages to transfer data between ACPI and Qemu, the first page is RAM-based used to save the input info of _DSM method and Qemu reuse it store output info and another page is MMIO-based, ACPI write data to this page to transfer the control to Qemu We use the address region above 4G to map these pages because there is huge free space above 4G and it can avoid the address overlap with PCI and other address reserved component (e,g HPET). This is also the reason we choose MMIO notification instead of PIO ====== Test ====== In host 1) create memory backed file, e.g # dd if=zero of=/tmp/nvdimm bs=1G count=10 2) append '-device pc-nvdimm,file=/tmp/nvdimm' in Qemu command line In guest, download the latest upsteam kernel (4.2 merge window) and enable ACPI_NFIT, LIBNVDIMM and BLK_DEV_PMEM. 1) insmod drivers/nvdimm/libnvdimm.ko 2) insmod drivers/acpi/nfit.ko 3) insmod drivers/nvdimm/nd_btt.ko 4) insmod drivers/nvdimm/nd_pmem.ko You can see the whole nvdimm device used as a single namespace and /dev/pmem0 appears. You can do whatever on /dev/pmem0 including DAX access. Currently Linux NVDIMM driver does not support namespace operation on this kind of PMEM, apply below changes to support dynamical namespace: @@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *a continue; } - if (nfit_mem->bdw && nfit_mem->memdev_pmem) + //if (nfit_mem->bdw && nfit_mem->memdev_pmem) + if (nfit_mem->memdev_pmem) flags |= NDD_ALIASING; You can append another NVDIMM device in guest and do: # cd /sys/bus/nd/devices/ # cd namespace1.0/ # echo `uuidgen` > uuid # echo `expr 1024 \* 1024 \* 128` > size then reload nd.pmem.ko You can see /dev/pmem1 appears ====== TODO ====== 1) NVDIMM NUMA support 2) NVDIMM hotplug support Xiao Guangrong (16): acpi: allow aml_operation_region() working on 64 bit offset i386/acpi-build: allow SSDT to operate on 64 bit acpi: add aml_derefof acpi: add aml_sizeof acpi: add aml_create_field pc: implement NVDIMM device abstract nvdimm: reserve address range for NVDIMM nvdimm: init backend memory mapping and config data area nvdimm: build ACPI NFIT table nvdimm: init the address region used by _DSM method nvdimm: build ACPI nvdimm devices nvdimm: save arg3 for NVDIMM device _DSM method nvdimm: support NFIT_CMD_IMPLEMENTED function nvdimm: support NFIT_CMD_GET_CONFIG_SIZE function nvdimm: support NFIT_CMD_GET_CONFIG_DATA nvdimm: support NFIT_CMD_SET_CONFIG_DATA hw/acpi/aml-build.c | 32 +- hw/i386/acpi-build.c | 9 +- hw/i386/acpi-dsdt.dsl | 2 +- hw/i386/pc.c | 11 +- hw/mem/Makefile.objs | 1 + hw/mem/pc-nvdimm.c | 1040 +++++++++++++++++++++++++++++++++++++++++++ include/hw/acpi/aml-build.h | 5 +- include/hw/mem/pc-nvdimm.h | 56 +++ 8 files changed, 1149 insertions(+), 7 deletions(-) create mode 100644 hw/mem/pc-nvdimm.c create mode 100644 include/hw/mem/pc-nvdimm.h -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html