Dan Williams <dan.j.williams@xxxxxxxxx> writes: > Jason Gunthorpe wrote: >> On Fri, Sep 23, 2022 at 09:29:51AM -0700, Dan Williams wrote: >> > > > /** >> > > > * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn >> > > > * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death >> > > > * @pfn: page frame number covered by @pgmap >> > > > */ >> > > > struct folio *pgmap_get_folio(struct dev_pagemap *pgmap, >> > > > unsigned long pfn) >> >> Maybe should be not be pfn but be 'offset from the first page of the >> pgmap' ? Then we don't need the xa_load stuff, since it cann't be >> wrong by definition. >> >> > > > { >> > > > struct page *page; >> > > > >> > > > VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys))); >> > > > >> > > > if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref))) >> > > > return NULL; >> > > >> > > This shouldn't be a WARN? >> > >> > It's a bug if someone calls this after killing the pgmap. I.e. the >> > expectation is that the caller is synchronzing this. The only reason >> > this isn't a VM_WARN_ONCE is because the sanity check is cheap, but I do >> > not expect it to fire on anything but a development kernel. >> >> OK, that makes sense >> >> But shouldn't this get the pgmap refcount here? The reason we started >> talking about this was to make all the pgmap logic self contained so >> that the pgmap doesn't pass its own destroy until all the all the >> page_free()'s have been done. That sounds good to me at least. I just noticed we introduced this exact bug for device private/coherent pages when making their refcounts zero based. Nothing currently takes pgmap->ref when a private/coherent page is mapped. Therefore memunmap_pages() will complete and pgmap destroyed while pgmap pages are still mapped. So I think we need to call put_dev_pagemap() as part of free_zone_device_page(). - Alistair >> > > > This does not create compound folios, that needs to be coordinated with >> > > > the caller and likely needs an explicit >> > > >> > > Does it? What situations do you think the caller needs to coordinate >> > > the folio size? Caller should call the function for each logical unit >> > > of storage it wants to allocate from the pgmap.. >> > >> > The problem for fsdax is that it needs to gather all the PTEs, hold a >> > lock to synchronize against events that would shatter a huge page, and >> > then build up the compound folio metadata before inserting the PMD. >> >> Er, at this point we are just talking about acquiring virgin pages >> nobody else is using, not inserting things. There is no possibility of >> conurrent shattering because, by definition, nothing else can >> reference these struct pages at this instant. >> >> Also, the caller must already be serializating pgmap_get_folio() >> against concurrent calls on the same pfn (since it is an error to call >> pgmap_get_folio() on an non-free pfn) >> >> So, I would expect the caller must already have all the necessary >> locking to accept maximally sized folios. >> >> eg if it has some reason to punch a hole in the contiguous range >> (shatter the folio) it must *already* serialize against >> pgmap_get_folio(), since something like punching a hole must know with >> certainty if any struct pages are refcount != 0 or not, and must not >> race with something trying to set their refcount to 1. > > Perhaps, I'll take a look. The scenario I am more concerned about is > processA sets up a VMA of PAGE_SIZE and races processB to fault in the > same filesystem block with a VMA of PMD_SIZE. Right now processA gets a > PTE mapping and processB gets a PMD mapping, but the refcounting is all > handled in small pages. I need to investigate more what is needed for > fsdax to support folio_size() > mapping entry size.