On Mon, Jan 16, 2017 at 6:00 PM, Jerome Glisse <jglisse@xxxxxxxxxx> wrote: > On Mon, Jan 16, 2017 at 04:58:24PM -0800, Dan Williams wrote: >> On Mon, Jan 16, 2017 at 12:13 PM, Jerome Glisse <jglisse@xxxxxxxxxx> wrote: >> > On Mon, Jan 16, 2017 at 11:31:39AM -0800, Dan Williams wrote: >> [..] >> >> >> dev_pagemap is only meant for get_user_pages() to do lookups of ptes >> >> >> with _PAGE_DEVMAP and take a reference against the hosting device.. >> >> > >> >> > And i want to build on top of that to extend _PAGE_DEVMAP to support >> >> > a new usecase for unaddressable device memory. >> >> > >> >> >> >> >> >> Why can't HMM use the typical vm_operations_struct fault path and push >> >> >> more of these details to a driver rather than the core? >> >> > >> >> > Because the vm_operations_struct has nothing to do with the device. >> >> > We are talking about regular vma here. Think malloc, mmap, share >> >> > memory, ... not about mmap(/dev/thedevice,...) >> >> > >> >> > So the vm_operations_struct is never under device control and we can >> >> > not, nor want to, rely on that. >> >> >> >> Can you explain more what's behind that "can not, nor want to" >> >> statement? It seems to me that any awkwardness of moving to a >> >> standalone device file interface is less than a maintaining a new / >> >> parallel mm fault path through dev_pagemap. >> > >> > The whole point of HMM is to allow transparent usage of process address >> > space on to a device like GPU. So it imply any vma (vm_area_struct) that >> > result from usual mmap (ie any mmap either PRIVATE or SHARE as long as it >> > is not a an mmap of a device file). >> > >> > It means that application can use malloc or the usual memory allocation >> > primitive of the langage (c++, rust, python, ...) and directly use the >> > memory it gets from that with the device. >> >> So you need 100% support of all these mm paths for this hardware to be >> useful at all? Does a separate device-driver and a userpace helper >> library get you something like 80% of the functionality and then we >> can debate the core mm changes to get the final 20%? Or am I just >> completely off base with how people want to use this hardware? > > Can't do that. Think library want to use GPU but you do not want to update > every single program that use that library and library get its memory from > the application. This is just one scenario. Then you have mmaped file, or > share memory, ... > > Transparent address space is where the industry is moving and sadly on some > platform (like Intel) we can not rely on hardware to solve it for us. > > >> > Device like GPU have a large pool of device memory that is not accessible >> > by the CPU. This device memory has 10 times more bandwidth than system >> > memory and has better latency then PCIE. Hence for the whole thing to >> > make sense you need to allow to use it. >> > >> > For that you need to allow migration from system memory to device memory. >> > Because you can not rely on special userspace allocator you have to >> > assume that the vma (vm_area_struct) is a regular one. So we are left >> > with having struct page for the device memory to allow migration to >> > work without requiring too much changes to existing mm. >> > >> > Because device memory is not accessible by the CPU, you can not allow >> > anyone to pin it and thus get_user_page* must trigger a migration back >> > as CPU page fault would. >> > >> > >> >> > So what we looking for here is struct page that can behave mostly >> >> > like anyother except that we do not want to allow GUP to take a >> >> > reference almost exactly what ZONE_DEVICE already provide. >> >> > >> >> > So do you have any fundamental objections to this patchset ? And if >> >> > so, how do you propose i solve the problem i am trying to address ? >> >> > Because hardware exist today and without something like HMM we will >> >> > not be able to support such hardware. >> >> >> >> My pushback stems from it being a completely different use case for >> >> devm_memremap_pages(), as evidenced by it growing from 4 arguments to >> >> 9, and the ongoing maintenance overhead of understanding HMM >> >> requirements when updating the pmem usage of ZONE_DEVICE. >> > >> > I rather reuse something existing and modify it to support more use case >> > than try to add ZONE_DEVICE2 or ZONE_DEVICE_I_AM_DIFFERENT. I have made >> > sure that my modifications to ZONE_DEVICE can be use without HMM. It is >> > just a generic interface to support page fault and to allow to track last >> > user of a device page. Both can be use indepentently from each other. >> > >> > To me the whole point of kernel is trying to share infrastructure accross >> > as many hardware as possible and i am doing just that. I do not think HMM >> > should be block because something that use to be for one specific use case >> > now support 2 use cases. I am not breaking anything existing. Is it more >> > work for you ? Maybe, but at Red Hat we intend to support it for as long >> > as it is needed so you always have some one to talk to if you want to >> > update ZONE_DEVICE. >> >> Sharing infrastructure should not come at the expense of type safety >> and clear usage rules. > > And where exactly do i violate that ? It's hard to judge without a user. For example, we found that fs/dax.c was violating block device safety lifetime rules that we solved with dax_map_atomic(), but that couldn't have been done without seeing both sides of the interface. ...but as I say that I'm aware that I don't have the background in graphics memory management like I do the block stack to review the usages. >> For example the pmem case, before exposing ZONE_DEVICE memory to other >> parts of the kernel, introduced the pfn_t type to distinguish DMA >> capable pfns from other raw pfns. All programmatic ways of discovering >> if a pmem range can support DMA use this type and explicit flags. > > I am protected from this because i do not allow GUP. GUP trigger migration > back to regular system memory. > >> >> While we may not need ZONE_DEVICE2 we obviously need a different >> wrapper around arch_add_memory() than devm_memremap_pages() for HMM >> and likely a different physical address radix than pgmap_radix because >> they are servicing 2 distinct purposes. For example, I don't think HMM >> should be using unmodified arch_add_memory(). We shouldn't add >> unaddressable memory to the linear address mappings when we know there >> is nothing behind it, especially when it seems all you need from >> arch_add_memory() is pfn_to_page() to be valid. > > And my patchset does just that, i do not add the device pfn to the linear > mapping because there is nothing there. In arch_add_memory() x86, ppc, arm > do barely more than setting up linear mapping and adding struct page. So > instead of splitting in two this function i just made the linear mapping > conditional. Sorry, I missed that. > I can split HMM from devm_memremap_pages() and thus from a different > pgmap_radix. You have to understand that this will not change most of > my patchset. > Sure, but I think it would worth it from a readability / maintenance perspective. With HMM being a superset of the existing dev_pagemap() usage it might make sense to just use struct dev_pagemap as a sub-structure of the hmm data. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>