On Thu, Oct 20, 2022 at 02:56:39PM -0700, Dan Williams wrote: > A 'struct dev_pagemap' (pgmap) represents a collection of ZONE_DEVICE > pages. The pgmap is a reference counted object that serves a similar > role as a 'struct request_queue'. Live references are obtained for each > in flight request / page, and once a page's reference count drops to > zero the associated pin of the pgmap is dropped as well. While a page is > idle nothing should be accessing it because that is effectively a > use-after-free situation. Unfortunately, all current ZONE_DEVICE > implementations deploy a layering violation to manage requests to > activate pages owned by a pgmap. Specifically, they take steps like walk > the pfns that were previously assigned at memremap_pages() time and use > pfn_to_page() to recall metadata like page->pgmap, or make use of other > data like page->zone_device_data. > > The first step towards correcting that situation is to provide a > API to get access to a pgmap page that does not require the caller to > know the pfn, nor access any fields of an idle page. Ideally this API > would be able to support dynamic page creation instead of the current > status quo of pre-allocating and initializing pages. > > On a prompt from Jason, introduce pgmap_request_folio() that operates on > an offset into a pgmap. It replaces the shortlived > pgmap_request_folios() that was continuing the layering violation of > assuming pages are available to be consulted before asking the pgmap to > make them available. > > For now this only converts the callers to lookup the pgmap and generate > the pgmap offset, but it does not do the deeper cleanup of teaching > those call sites to generate those arguments without walking the page > metadata. For next steps it appears the DEVICE_PRIVATE implementations > could plumb the pgmap into the necessary callsites and switch to using > gen_pool_alloc() to track which offsets of a pgmap are allocated. For > DAX, dax_direct_access() could switch from returning pfns to returning > the associated @pgmap and @pgmap_offset. Those changes are saved for > follow-on work. I like it, though it would be nice to see drivers converted away from pfn_to_pgmap_offset().. > /** > - * pgmap_request_folios - activate an contiguous span of folios in @pgmap > - * @pgmap: host page map for the folio array > - * @folio: start of the folio list, all subsequent folios have same folio_size() > + * pgmap_request_folio - activate a folio of a given order in @pgmap > + * @pgmap: host page map of the folio to activate > + * @pgmap_offset: page-offset into the pgmap to request > + * @order: expected folio_order() of the folio > * > * Caller is responsible for @pgmap remaining live for the duration of > - * this call. Caller is also responsible for not racing requests for the > - * same folios. > + * this call. The order (size) of the folios in the pgmap are assumed > + * stable before this call. > */ I would probably add some discussion here that this enables refcounting on the folio and the pgmap_ops page free will be called once the folio is no longer being used. And explain that the pgmap user is responsible for tracking which pgmap_offsets are requested and which have been returned by free. It would be nice to say that this can only be called on free'd folios. > -bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio, > - int nr_folios) > +struct folio *pgmap_request_folio(struct dev_pagemap *pgmap, > + pgoff_t pgmap_offset, int order) unsigned int order? > { > - struct folio *iter; > - int i; > + unsigned long pfn = pgmap_offset_to_pfn(pgmap, pgmap_offset); > + struct page *page = pfn_to_page(pfn); > + struct folio *folio; > + int v; > > - /* > - * All of the WARNs below are for catching bugs in future > - * development that changes the assumptions of: > - * 1/ uniform folios in @pgmap > - * 2/ @pgmap death does not race this routine. > - */ > - VM_WARN_ON_ONCE(!folio_span_valid(pgmap, folio, nr_folios)); > + if (WARN_ON_ONCE(page->pgmap != pgmap)) > + return NULL; Checking that pgmap_offset is not bigger than pgmap length is also a good assertion.. At that point if pgmap is not right then the struct page has been corrupted. > > if (WARN_ON_ONCE(percpu_ref_is_dying(&pgmap->ref))) > - return false; > + return NULL; > > - for (iter = folio_next(folio), i = 1; i < nr_folios; > - iter = folio_next(folio), i++) > - if (WARN_ON_ONCE(folio_order(iter) != folio_order(folio))) > - return false; > + folio = page_folio(page); > + if (WARN_ON_ONCE(folio_order(folio) != order)) > + return NULL; Do you see a blocker to simply restructuring the pages into head/tail here? If the refcounts are all zero it should be safe? > + v = folio_ref_inc_return(folio); > + if (v > 1) > + return folio; IMHO, ideally, this should require the foilio to have a 0 refcount and this should set it to 1. > + if (WARN_ON_ONCE(!percpu_ref_tryget(&pgmap->ref))) { This should not be a warn on, there should be races where the dying check could miss but the refcounts all reached zero anyhow. Jason