Am 2021-09-02 um 4:18 a.m. schrieb Christoph Hellwig: > On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote: >>>>> It looks like I'm totally misunderstanding what you are adding here >>>>> then. Why do we need any special treatment at all for memory that >>>>> has normal struct pages and is part of the direct kernel map? >>>> The pages are like normal memory for purposes of mapping them in CPU >>>> page tables and for coherent access from the CPU. >>> That's the user page tables. What about the kernel direct map? >>> If there is a normal kernel struct page backing there really should >>> be no need for the pgmap. >> I'm not sure. The physical address ranges are in the UEFI system address >> map as special-purpose memory. Does Linux create the struct pages and >> kernel direct map for that without a pgmap call? I didn't see that last >> time I went digging through that code. > So doing some googling finds a patch from Dan that claims to hand EFI > special purpose memory to the device dax driver. But when I try to > follow the version that got merged it looks it is treated simply as an > MMIO region to be claimed by drivers, which would not get a struct page. > > Dan, did I misunderstand how E820_TYPE_SOFT_RESERVED works? > >>>> From an application >>>> perspective, we want file-backed and anonymous mappings to be able to >>>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to >>>> optimize performance for GPU heavy workloads while minimizing the need >>>> to migrate data back-and-forth between system memory and device memory. >>> I don't really understand that part. file backed pages are always >>> allocated by the file system using the pagecache helpers, that is >>> using the page allocator. Anonymouns memory also always comes from >>> the page allocator. >> I'm coming at this from my experience with DEVICE_PRIVATE. Both >> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE >> memory by the migrate_vma_* helpers for more efficient access by our >> GPU. (*) It's part of the basic premise of HMM as I understand it. I >> would expect the same thing to work for DEVICE_PUBLIC memory. > Ok, so you want to migrate to and from them. Not use DEVICE_PUBLIC > for the actual page cache pages. That maks a lot more sense. > >> I see DEVICE_PUBLIC as an improved version of DEVICE_PRIVATE that allows >> the CPU to map the device memory coherently to minimize the need for >> migrations when CPU and GPU access the same memory concurrently or >> alternatingly. But we're not going as far as putting that memory >> entirely under the management of the Linux memory manager and VM >> subsystem. Our (and HPE's) system architects decided that this memory is >> not suitable to be used like regular NUMA system memory by the Linux >> memory manager. > So yes. It is a Memory Mapped I/O region, which unlike the PCIe BARs > that people typically deal with is fully cache coherent. I think this > does make more sense as a description. > > But to go back to what start this discussion: If these are memory > mapped I/O pfn_valid should generally not return true for them. As I understand it, pfn_valid should be true for any pfn that's part of the kernel's physical memory map, i.e. is returned by page_to_pfn or works with pfn_to_page. Both the hmm_range_fault and the migrate_vma_* APIs use pfns to refer to regular system memory and ZONE_DEVICE pages (even DEVICE_PRIVATE). Therefore I believe pfn_valid should be true for ZONE_DEVICE pages as well. Regards, Felix > > And as you already pointed out in reply to Alex we need to tighten the > selection criteria one way or another.