Felix Kuehling wrote: > Am 2022-10-20 um 17:56 schrieb Dan Williams: > > A 'struct dev_pagemap' (pgmap) represents a collection of ZONE_DEVICE > > pages. The pgmap is a reference counted object that serves a similar > > role as a 'struct request_queue'. Live references are obtained for each > > in flight request / page, and once a page's reference count drops to > > zero the associated pin of the pgmap is dropped as well. While a page is > > idle nothing should be accessing it because that is effectively a > > use-after-free situation. Unfortunately, all current ZONE_DEVICE > > implementations deploy a layering violation to manage requests to > > activate pages owned by a pgmap. Specifically, they take steps like walk > > the pfns that were previously assigned at memremap_pages() time and use > > pfn_to_page() to recall metadata like page->pgmap, or make use of other > > data like page->zone_device_data. > > > > The first step towards correcting that situation is to provide a > > API to get access to a pgmap page that does not require the caller to > > know the pfn, nor access any fields of an idle page. Ideally this API > > would be able to support dynamic page creation instead of the current > > status quo of pre-allocating and initializing pages. > > If I understand it correctly, the current code works because the struct > pages are pre-allocated. Is the end-goal here to make the struct page > allocation for ZONE_DEVICE pages dynamic. Some DEVICE_PRIVATE users have already open-coded their own coarse grained dynamic ZONE_DEVICE pages by waiting to allocate chunks on demand. The users that would benefit from a general dynamic ZONE_DEVICE facility are cases like VMs backed by device-dax instances. Unless the VM calls for bare metal services there is no need to map pages for the device-dax instance in the hypervisor. So, the end goal here is to just add some sanity to ZONE_DEVICE page reference counting to allow for requiring an arbitration for page access rather than just pfn_to_page() and assuming the page is already there. Dynamic ZONE_DEVICE becomes something that is possible once that sanity is in place. > > On a prompt from Jason, introduce pgmap_request_folio() that operates on > > an offset into a pgmap. > > This looks like it would make it fairly easy to request larger (higher > order) folios for physically contiguous device memory allocations in the > future. > > > > It replaces the shortlived > > pgmap_request_folios() that was continuing the layering violation of > > assuming pages are available to be consulted before asking the pgmap to > > make them available. > > > > For now this only converts the callers to lookup the pgmap and generate > > the pgmap offset, but it does not do the deeper cleanup of teaching > > those call sites to generate those arguments without walking the page > > metadata. For next steps it appears the DEVICE_PRIVATE implementations > > could plumb the pgmap into the necessary callsites and switch to using > > gen_pool_alloc() to track which offsets of a pgmap are allocated. > > Wouldn't that duplicate whatever device memory allocator we already have > in our driver? Couldn't I just take the memory allocation from our TTM > allocator and make necessary pgmap_request_folio calls to allocate the > corresponding pages from the pgmap? I think you could, as long as the output from that allocator is a pgmap_offset rather than a pfn. > Or does the pgmap allocation need a finer granularity than the device > memory allocation? I would say the pgmap *allocation* happens at memremap_pages() time. pgmap_request_folio() is a request to put a pgmap page into service. So, yes, I think you can bring your own allocator for what offsets are in/out of service in pgmap space.