On Wed, Feb 17, 2016 at 7:25 AM, Haggai Eran <haggaie@xxxxxxxxxxxx> wrote: > On 17/02/2016 10:44, Christoph Hellwig wrote: >> That doesn't change how the are managed. We've always suppored mapping >> BARs to userspace in various drivers, and the only real news with things >> like the pmem driver with DAX or some of the things people want to do >> with the NVMe controller memoery buffer is that there are much bigger >> quantities of it, and: >> >> a) people want to be able have cachable mappings of various kinds >> instead of the old uncachable default. > What if we do want an uncachable mapping for our device's BAR. Can we still > expose it under ZONE_DEVICE? > >> b) we want to be able to DMA (including RDMA) to the regions in the >> BARs. >> >> a) is something that needs smaller amounts in all kinds of areas to be >> done properly, but in principle GPU drivers have been doing this forever >> using all kinds of hacks. >> >> b) is the real issue. The Linux DMA support code doesn't really operate >> on just physical addresses, but on page structures, and we don't >> allocate for BARs. We investigated two ways to address this: 1) allow >> DMA operations without struct page and 2) create struct page structures >> for BARs that we want to be able to use DMA operations on. For various >> reasons version 2) was favored and this is how we ended up with >> ZONE_DEVICE. Read the linux-mm and linux-nvdimm lists for the lenghty >> discussions how we ended up here. > > I was wondering what are your thoughts regarding the other questions we raised > about ZONE_DEVICE. > > How can we overcome the section-alignment requirement in the current code? Our > HCA's BARs are usually smaller than 128MB. This may not help, but note that the section-alignment only bites when trying to have 2 mappings with different lifetimes in a single section. It's otherwise fine to map a full section for a smaller single range, you'll just end up with pages that won't be used. However, this assumes that you are fine with everything in that section being mapped cacheable, you couldn't mix uncacheable mappings in that same range. > Sagi also asked how should a peer device who got a ZONE_DEVICE page know it > should stop using it (the CMB example). ZONE_DEVICE pages come with a per-cpu reference counter via page->pgmap. See get_dev_pagemap(), get_zone_device_page(), and put_zone_device_page(). However this gets confusing quickly when a 'pfn' and a 'page' start referencing mmio space instead of host memory. It seems like we need new data types because a dma_addr_t does not necessarily reflect the peer-to-peer address as seen by the device. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>