Jason Gunthorpe wrote: > On Fri, Sep 23, 2022 at 09:29:51AM -0700, Dan Williams wrote: > > > > /** > > > > * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn > > > > * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death > > > > * @pfn: page frame number covered by @pgmap > > > > */ > > > > struct folio *pgmap_get_folio(struct dev_pagemap *pgmap, > > > > unsigned long pfn) > > Maybe should be not be pfn but be 'offset from the first page of the > pgmap' ? Then we don't need the xa_load stuff, since it cann't be > wrong by definition. > > > > > { > > > > struct page *page; > > > > > > > > VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys))); > > > > > > > > if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref))) > > > > return NULL; > > > > > > This shouldn't be a WARN? > > > > It's a bug if someone calls this after killing the pgmap. I.e. the > > expectation is that the caller is synchronzing this. The only reason > > this isn't a VM_WARN_ONCE is because the sanity check is cheap, but I do > > not expect it to fire on anything but a development kernel. > > OK, that makes sense > > But shouldn't this get the pgmap refcount here? The reason we started > talking about this was to make all the pgmap logic self contained so > that the pgmap doesn't pass its own destroy until all the all the > page_free()'s have been done. > > > > > This does not create compound folios, that needs to be coordinated with > > > > the caller and likely needs an explicit > > > > > > Does it? What situations do you think the caller needs to coordinate > > > the folio size? Caller should call the function for each logical unit > > > of storage it wants to allocate from the pgmap.. > > > > The problem for fsdax is that it needs to gather all the PTEs, hold a > > lock to synchronize against events that would shatter a huge page, and > > then build up the compound folio metadata before inserting the PMD. > > Er, at this point we are just talking about acquiring virgin pages > nobody else is using, not inserting things. There is no possibility of > conurrent shattering because, by definition, nothing else can > reference these struct pages at this instant. > > Also, the caller must already be serializating pgmap_get_folio() > against concurrent calls on the same pfn (since it is an error to call > pgmap_get_folio() on an non-free pfn) > > So, I would expect the caller must already have all the necessary > locking to accept maximally sized folios. > > eg if it has some reason to punch a hole in the contiguous range > (shatter the folio) it must *already* serialize against > pgmap_get_folio(), since something like punching a hole must know with > certainty if any struct pages are refcount != 0 or not, and must not > race with something trying to set their refcount to 1. Perhaps, I'll take a look. The scenario I am more concerned about is processA sets up a VMA of PAGE_SIZE and races processB to fault in the same filesystem block with a VMA of PMD_SIZE. Right now processA gets a PTE mapping and processB gets a PMD mapping, but the refcounting is all handled in small pages. I need to investigate more what is needed for fsdax to support folio_size() > mapping entry size.