On 08/23/2017 11:39 AM, Dan Williams wrote: > On Mon, Aug 21, 2017 at 2:11 PM, Dave Jiang <dave.jiang@xxxxxxxxx> wrote: >> Adding a DMA supported blk-mq driver for pmem. > > "Add support for offloading pmem block-device I/O operations to a DMA engine." > >> This provides signficant CPU > > *significant > >> utilization reduction. > > "at the cost of some increased latency and bandwidth reduction in some cases." > >> By default the pmem driver will be using blk-mq with > > "By default the current cpu-copy based pmem driver will load, but this > driver can be manually selected with a modprobe configuration." > >> DMA through the dmaengine API. DMA can be turned off with use_dma=0 kernel >> parameter. > > Do we need the module option? It seems for debug / testing a user can > simply unload the ioatdma driver, otherwise we should use dma by > default. > >> Additional kernel parameters are provided: >> >> queue_depth: The queue depth for blk-mq. Typically in relation to what the >> DMA engine can provide per queue/channel. This needs to take >> into account of num_sg as well for some DMA engines. i.e. >> num_sg * queue_depth < total descriptors available per queue or >> channel. >> >> q_per_node: Hardware queues per node. Typically the number of channels the >> DMA engine can provide per socket. >> num_sg: Number of scatterlist we can handle per I/O request. > > Why do these need to be configurable? The concern is with other arch/platforms that have different DMA engines. The configurations would be platform dependent. > >> >> Numbers below are measured against pmem simulated via DRAM using >> memmap=NN!SS. DMA engine used is the ioatdma on Intel Skylake Xeon >> platform. Keep in mind the performance for persistent memory >> will differ. >> Fio 2.21 was used. >> >> 64k: 1 task queuedepth=1 >> CPU Read: 7631 MB/s 99.7% CPU DMA Read: 2415 MB/s 54% CPU >> CPU Write: 3552 MB/s 100% CPU DMA Write 2173 MB/s 54% CPU >> >> 64k: 16 tasks queuedepth=16 >> CPU Read: 36800 MB/s 1593% CPU DMA Read: 29100 MB/s 607% CPU >> CPU Write 20900 MB/s 1589% CPU DMA Write: 23400 MB/s 585% CPU >> >> 2M: 1 task queuedepth=1 >> CPU Read: 6013 MB/s 99.3% CPU DMA Read: 7986 MB/s 59.3% CPU >> CPU Write: 3579 MB/s 100% CPU DMA Write: 5211 MB/s 58.3% CPU >> >> 2M: 16 tasks queuedepth=16 >> CPU Read: 18100 MB/s 1588% CPU DMA Read: 21300 MB/s 180.9% CPU >> CPU Write: 14100 MB/s 1594% CPU DMA Write: 20400 MB/s 446.9% CPU >> >> Signed-off-by: Dave Jiang <dave.jiang@xxxxxxxxx> >> Reviewed-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> >> >> Signed-off-by: Dave Jiang <dave.jiang@xxxxxxxxx> >> --- >> drivers/nvdimm/Kconfig | 23 + >> drivers/nvdimm/Makefile | 3 >> drivers/nvdimm/pmem.h | 3 >> drivers/nvdimm/pmem_mq.c | 853 ++++++++++++++++++++++++++++++++++++++++++++++ >> 4 files changed, 882 insertions(+) >> create mode 100644 drivers/nvdimm/pmem_mq.c >> >> diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig >> index 5bdd499..c88c2bb 100644 >> --- a/drivers/nvdimm/Kconfig >> +++ b/drivers/nvdimm/Kconfig >> @@ -36,6 +36,29 @@ config BLK_DEV_PMEM >> >> Say Y if you want to use an NVDIMM >> >> +config BLK_DEV_PMEM_MQ > > Lets call the driver pmem_dma instead of pmem_mq, the "mq" is just a > side effect of enabling dma support. > >> + tristate "PMEM: Persistent memory block device multi-queue support" >> + depends on m > > Not sure what the exact kconfig syntax should be, but the policy I'm > looking for is that this driver should be allowed to be built-in if > the pmem driver is disabled. > >> + default LIBNVDIMM >> + select DAX >> + select ND_BTT if BTT >> + select ND_PFN if NVDIMM_PFN > > I think these should probably move to a common symbol used by both > pmem and pmem_dma. > >> + select DMA_ENGINE > > Shouldn't this be "depends on"? > >> + help >> + Memory ranges for PMEM are described by either an NFIT >> + (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a >> + non-standard OEM-specific E820 memory type (type-12, see >> + CONFIG_X86_PMEM_LEGACY), or it is manually specified by the >> + 'memmap=nn[KMG]!ss[KMG]' kernel command line (see >> + Documentation/admin-guide/kernel-parameters.rst). This driver >> + converts these persistent memory ranges into block devices that are >> + capable of DAX (direct-access) file system mappings. See >> + Documentation/nvdimm/nvdimm.txt for more details. This driver >> + utilizes block layer multi-queue in order to support using DMA >> + engines to help offload the data copying. > > Rather than duplicating some of the pmem text I think this help text > should only talk about the incremental differences relative to the > base pmem driver. > >> + >> + Say Y if you want to use an NVDIMM >> + >> config ND_BLK >> tristate "BLK: Block data window (aperture) device support" >> default LIBNVDIMM >> diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile >> index 909554c..8bfd107 100644 >> --- a/drivers/nvdimm/Makefile >> +++ b/drivers/nvdimm/Makefile >> @@ -1,11 +1,14 @@ >> obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o >> obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o >> +obj-$(CONFIG_BLK_DEV_PMEM_MQ) += nd_pmem_mq.o >> obj-$(CONFIG_ND_BTT) += nd_btt.o >> obj-$(CONFIG_ND_BLK) += nd_blk.o >> obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o >> >> nd_pmem-y := pmem.o >> >> +nd_pmem_mq-y := pmem_mq.o >> + > > s/mq/dma/ > > I also think it's okay to drop the nd_ prefix for this. > >> nd_btt-y := btt.o >> >> nd_blk-y := blk.o >> diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h >> index 5434321..bbbe982 100644 >> --- a/drivers/nvdimm/pmem.h >> +++ b/drivers/nvdimm/pmem.h >> @@ -4,6 +4,7 @@ >> #include <linux/types.h> >> #include <linux/pfn_t.h> >> #include <linux/fs.h> >> +#include <linux/blk-mq.h> >> >> #ifdef CONFIG_ARCH_HAS_PMEM_API >> #define ARCH_MEMREMAP_PMEM MEMREMAP_WB >> @@ -35,6 +36,8 @@ struct pmem_device { >> struct badblocks bb; >> struct dax_device *dax_dev; >> struct gendisk *disk; >> + struct blk_mq_tag_set tag_set; >> + struct request_queue *q; >> }; >> >> long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, >> diff --git a/drivers/nvdimm/pmem_mq.c b/drivers/nvdimm/pmem_mq.c >> new file mode 100644 >> index 0000000..db2fc2a >> --- /dev/null >> +++ b/drivers/nvdimm/pmem_mq.c > > [snip bulk of driver code] > > Rather than copy-paste most of the existing pmem driver let's refactor > all the common bits into a pmem-core module. > >> +MODULE_AUTHOR("Dave Jiang <dave.jiang@xxxxxxxxx>"); > > Drop this MODULE_AUTHOR(), I think this is better handled with a > MAINTAINERS entry. > -- To unsubscribe from this list: send the line "unsubscribe dmaengine" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html