[PATCH RFC 1/1] Add support for ZONE_DEVICE IO memory with struct pages.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: Logan Gunthorpe <logang@xxxxxxxxxxxx>

Introduction
------------

This RFC patch adds struct page backing for IO memory and as such
allows IO memory to be used as a DMA target.

Patch Summary
-------------

We build on recent work done that adds memory regions owned by a device
driver (ZONE_DEVICE) [1] and to add struct page support for these new
regions of memory [2]. The patch should apply cleanly to 4.5-rc7.

1. Add a new flag (MEMREMAP_WC) to the enumerated type in include/io.h.

2. Add an extra flags argument into dev_memremap_pages to take in a
MEMREMAP_XX argument. We update the existing calls to this function to
reflect the change.

3. For completeness, we add MEMREMAP_WC support to the memremap;
however we have no actual need for this functionality.

4. We add the static functions, add_zone_device_pages and
remove_zone_device pages. These are similar to arch_add_memory except
they don't create the memory mapping. We don't believe these need to be
made arch specific, but are open to other opinions.

5. dev_memremap_pages and devm_memremap_pages_release are updated to
treat IO memory slightly differently. For IO memory we use a combination
of the appropriate io_remap function and the zone_device pages functions
created above. A flags variable and kaddr pointer are added to struct
page_mem to facilitate this for the release function.

Motivation and Use Cases
------------------------

PCIe IO devices are getting faster. It is not uncommon now to find PCIe
network and storage devices that can generate and consume several GB/s.
Almost always these devices have either a high performance DMA engine, a
number of exposed PCIe BARs or both.

Until this patch, any high-performance transfer of information between
two PICe devices has required the use of a staging buffer in system
memory. With this patch the bandwidth to system memory is not compromised
when high-throughput transfers occurs between PCIe devices. This means
that more system memory bandwidth is available to the CPU cores for data
processing and manipulation. In addition, in systems where the two PCIe
devices reside behind a PCIe switch the datapath avoids the CPU entirely.

Consumers
---------

In order to test this patch and to give an example of a consumer of this
patch we have developed a PCIe driver which can be used to bind to any
PCIe device that exposes IO memory via a PCIe BAR. A basic out of tree
module can be found at [3] which is intended only for testing and example
purposes. Assuming this RFC is received well, the first in-kernel user
would be a patchset which facilitates the Controller Memory Buffer (CMB)
feature for NVMe. This would enable RDMA devices to move data directly
to/from files on NVMe devices without the need for a round trip through
system memory.

The example out of tree module allows for the following accesses:

1. Block based access. For this we borrowed heavily from the pmem device
driver. Note that this mode allows for DAX filesystems to be mounted
on the block device. The driver uses devm_memremap_pages on a block of
PCIe memory such that files on this FS can then be mmapped using direct
access and used as a DMA target.

2. MMAP based access. The driver again uses devm_memremap_pages, then
hands out ZONE_DEVICE backed pages with its fault function. Thus, with
this patch, these mappings would now be usable as DMA targets.

Testing and Performance
-----------------------

We have done a moderate about of testing of this patch on a QEMU
environment and a real hardware environment. On real hardware we have
observed peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s.
In both cases these numbers are limitations of our consumer hardware. In
addtion, we have observed that the CPU DRAM bandwidth is not impacted
when using IOPMEM which is not the case when a traditional patch through
system memory is taken.

For more information on the testing and performance results see the
GitHub site [4].

Known Issues
------------

1. Address Translation. Suggestions have been made that in certain
architectures and topologies the dma_addr_t passed to the DMA master
in a peer-2-peer transfer will not correctly route to the IO memory
intended. However in our testing to date we have not seen this to be
an issue, even in systems with IOMMUs and PCIe switches. It is our
understanding that an IOMMU only maps system memory and would not
interfere with device memory regions. (It certainly has no opportunity
to do so if the transfer gets routed through a switch).

2. Memory Segment Spacing. This patch has the same limitations that
ZONE_DEVICE does in that memory regions must be spaces at least
SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
BARs can be placed closer together than this. Thus ZONE_DEVICE would not
be usable on neighboring BARs. For our purposes, this is not an issue as
we'd only be looking at enabling a single BAR in a given PCIe device.
More exotic use cases may have problems with this.

3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
peer there is potential for coherency issues and for writes to occur out
of order. This is something that users of this feature need to be
cognizant of and may necessitate the use of CONFIG_EXPERT. Though really,
this isn't much different than the existing situation with RDMA: if
userspace sets up an MR for remote use, they need to be careful about
using that memory region themselves.

4. Architecture. Currently this patch is applicable only to x86
architectures. The same is true for much of the code pertaining to
PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other
ARCH over time.

References
----------

[1] https://lists.01.org/pipermail/linux-nvdimm/2015-August/001810.html
[2] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002387.html
[3] https://github.com/sbates130272/linux-donard/tree/iopmem-rfc
[4] https://github.com/sbates130272/zone-device

Signed-off-by: Stephen Bates <stephen.bates@xxxxxxxx>
Signed-off-by: Logan Gunthorpe <logang@xxxxxxxxxxxx>
---
 drivers/nvdimm/pmem.c    |  4 +--
 include/linux/io.h       |  1 +
 include/linux/memremap.h |  5 ++--
 kernel/memremap.c        | 78 ++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 79 insertions(+), 9 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 8d0b546..095f358 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -184,7 +184,7 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	pmem->pfn_flags = PFN_DEV;
 	if (pmem_should_map_pages(dev)) {
 		pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res,
-				&q->q_usage_counter, NULL);
+				&q->q_usage_counter, NULL, ARCH_MEMREMAP_PMEM);
 		pmem->pfn_flags |= PFN_MAP;
 	} else
 		pmem->virt_addr = (void __pmem *) devm_memremap(dev,
@@ -410,7 +410,7 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
 	q = pmem->pmem_queue;
 	devm_memunmap(dev, (void __force *) pmem->virt_addr);
 	pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res,
-			&q->q_usage_counter, altmap);
+			&q->q_usage_counter, altmap, ARCH_MEMREMAP_PMEM);
 	pmem->pfn_flags |= PFN_MAP;
 	if (IS_ERR(pmem->virt_addr)) {
 		rc = PTR_ERR(pmem->virt_addr);
diff --git a/include/linux/io.h b/include/linux/io.h
index 32403b5..e2c8419 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -135,6 +135,7 @@ enum {
 	/* See memremap() kernel-doc for usage description... */
 	MEMREMAP_WB = 1 << 0,
 	MEMREMAP_WT = 1 << 1,
+	MEMREMAP_WC = 1 << 2,
 };

 void *memremap(resource_size_t offset, size_t size, unsigned long flags);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index bcaa634..ad5cf23 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -51,12 +51,13 @@ struct dev_pagemap {

 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-		struct percpu_ref *ref, struct vmem_altmap *altmap);
+		struct percpu_ref *ref, struct vmem_altmap *altmap,
+		unsigned long flags);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
-		struct vmem_altmap *altmap)
+		struct vmem_altmap *altmap, unsigned long flags)
 {
 	/*
 	 * Fail attempts to call devm_memremap_pages() without
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b981a7b..a739286 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -41,7 +41,7 @@ static void *try_ram_remap(resource_size_t offset, size_t size)
  * memremap() - remap an iomem_resource as cacheable memory
  * @offset: iomem resource start address
  * @size: size of remap
- * @flags: either MEMREMAP_WB or MEMREMAP_WT
+ * @flags: either MEMREMAP_WB, MEMREMAP_WT or MEMREMAP_WC
  *
  * memremap() is "ioremap" for cases where it is known that the resource
  * being mapped does not have i/o side effects and the __iomem
@@ -57,6 +57,9 @@ static void *try_ram_remap(resource_size_t offset, size_t size)
  * cache or are written through to memory and never exist in a
  * cache-dirty state with respect to program visibility.  Attempts to
  * map "System RAM" with this mapping type will fail.
+ *
+ * MEMREMAP_WC - like MEMREMAP_WT but enables write combining.
+ *
  */
 void *memremap(resource_size_t offset, size_t size, unsigned long flags)
 {
@@ -101,6 +104,12 @@ void *memremap(resource_size_t offset, size_t size, unsigned long flags)
 		addr = ioremap_wt(offset, size);
 	}

+
+	if (!addr && (flags & MEMREMAP_WC)) {
+		flags &= ~MEMREMAP_WC;
+		addr = ioremap_wc(offset, size);
+	}
+
 	return addr;
 }
 EXPORT_SYMBOL(memremap);
@@ -164,13 +173,41 @@ static RADIX_TREE(pgmap_radix, GFP_KERNEL);
 #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)

+enum {
+	PAGEMAP_IO_MEM = 1 << 0,
+};
+
 struct page_map {
 	struct resource res;
 	struct percpu_ref *ref;
 	struct dev_pagemap pgmap;
 	struct vmem_altmap altmap;
+	void *kaddr;
+	int flags;
 };

+static int add_zone_device_pages(int nid, u64 start, u64 size)
+{
+	struct pglist_data *pgdat = NODE_DATA(nid);
+	struct zone *zone = pgdat->node_zones + ZONE_DEVICE;
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+
+	return __add_pages(nid, zone, start_pfn, nr_pages);
+}
+
+static void remove_zone_device_pages(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	ret = __remove_pages(zone, start_pfn, nr_pages);
+	WARN_ON_ONCE(ret);
+}
+
 void get_zone_device_page(struct page *page)
 {
 	percpu_ref_get(page->pgmap->ref);
@@ -235,8 +272,15 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	/* pages are dead and unused, undo the arch mapping */
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(resource_size(res), SECTION_SIZE);
-	arch_remove_memory(align_start, align_size);
+
+	if (page_map->flags & PAGEMAP_IO_MEM) {
+		remove_zone_device_pages(align_start, align_size);
+		iounmap(page_map->kaddr);
+	} else {
+		arch_remove_memory(align_start, align_size);
+	}
 	pgmap_radix_release(res);
+
 	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
 			"%s: failed to free all reserved pages\n", __func__);
 }
@@ -258,6 +302,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  * @res: "host memory" address range
  * @ref: a live per-cpu reference count
  * @altmap: optional descriptor for allocating the memmap from @res
+ * @flags: either MEMREMAP_WB, MEMREMAP_WT or MEMREMAP_WC
+ *         see memremap() for a description of the flags
  *
  * Notes:
  * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
@@ -268,7 +314,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  *    this is not enforced.
  */
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-		struct percpu_ref *ref, struct vmem_altmap *altmap)
+		struct percpu_ref *ref, struct vmem_altmap *altmap,
+		unsigned long flags)
 {
 	int is_ram = region_intersects(res->start, resource_size(res),
 			"System RAM");
@@ -277,6 +324,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	struct page_map *page_map;
 	unsigned long pfn;
 	int error, nid;
+	void *addr = NULL;

 	if (is_ram == REGION_MIXED) {
 		WARN_ONCE(1, "%s attempted on mixed region %pr\n",
@@ -344,10 +392,28 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	if (nid < 0)
 		nid = numa_mem_id();

-	error = arch_add_memory(nid, align_start, align_size, true);
+	if (flags & MEMREMAP_WB || !flags) {
+		error = arch_add_memory(nid, align_start, align_size, true);
+		addr = __va(res->start);
+	} else {
+		page_map->flags |= PAGEMAP_IO_MEM;
+		error = add_zone_device_pages(nid, align_start, align_size);
+	}
+
 	if (error)
 		goto err_add_memory;

+	if (!addr && (flags & MEMREMAP_WT))
+		addr = ioremap_wt(res->start, resource_size(res));
+
+	if (!addr && (flags & MEMREMAP_WC))
+		addr = ioremap_wc(res->start, resource_size(res));
+
+	if (!addr && page_map->flags & PAGEMAP_IO_MEM) {
+		remove_zone_device_pages(res->start, resource_size(res));
+		goto err_add_memory;
+	}
+
 	for_each_device_pfn(pfn, page_map) {
 		struct page *page = pfn_to_page(pfn);

@@ -355,8 +421,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		list_force_poison(&page->lru);
 		page->pgmap = pgmap;
 	}
+
+	page_map->kaddr = addr;
 	devres_add(dev, page_map);
-	return __va(res->start);
+	return addr;

  err_add_memory:
  err_radix:
--
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]