Dan Williams <dan.j.williams@xxxxxxxxx> writes: > Jason Gunthorpe wrote: >> On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote: >> > The initial memremap_pages() implementation inherited the >> > __init_single_page() default of pages starting life with an elevated >> > reference count. This originally allowed for the page->pgmap pointer to >> > alias with the storage for page->lru since a page was only allowed to be >> > on an lru list when its reference count was zero. >> > >> > Since then, 'struct page' definition cleanups have arranged for >> > dedicated space for the ZONE_DEVICE page metadata, and the >> > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0 >> > page->_refcount transition to route the page to free_zone_device_page() >> > and not the core-mm page-free. With those cleanups in place and with >> > filesystem-dax and device-dax now converted to take and drop references >> > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX >> > and MEMORY_DEVICE_GENERIC reference counts at 0. >> > >> > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE >> > pages start life at _refcount 1, so make that the default if >> > pgmap->init_mode is left at zero. >> >> I'm shocked to read this - how does it make any sense? > > I think what happened is that since memremap_pages() historically > produced pages with an elevated reference count that GPU drivers skipped > taking a reference on first allocation and just passed along an elevated > reference count page to the first user. > > So either we keep that assumption or update all users to be prepared for > idle pages coming out of memremap_pages(). > > This is all in reaction to the "set_page_count(page, 1);" in > free_zone_device_page(). Which I am happy to get rid of but need from > help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to > memremap_pages() starting all pages at reference count 0. This is all rather good timing - This week I've been in the middle of getting a series together which fixes this among other things. So I'm all for fixing it and can help with that - my motivation was I needed to be able to tell if a page is free or not with get_page_unless_zero() etc. which doesn't work at the moment because free device private/coherent pages have an elevated refcount. - Alistair >> dev_pagemap_ops->page_free() is only called on the 1->0 transition, so >> any driver which implements it must be expecting pages to have a 0 >> refcount. >> >> Looking around everything but only fsdax_pagemap_ops implements >> page_free() > > Right. > >> So, how does it work? Surely the instant the page map is created all >> the pages must be considered 'free', and after page_free() is called I >> would also expect the page to be considered free. > > The GPU drivers need to increment reference counts when they hand out > the page rather than reuse the reference count that they get by default. > >> How on earth can a free'd page have both a 0 and 1 refcount?? > > This is residual wonkiness from memremap_pages() handing out pages with > elevated reference counts at the outset. > >> eg look at the simple hmm_test, it threads pages on to the >> mdevice->free_pages list immediately after memremap_pages and then >> again inside page_free() - it is completely wrong that they would have >> different refcounts while on the free_pages list. > > I do not see any page_ref_inc() in that test, only put_page() so it is > assuming non-idle pages at the outset. > >> I would expect that after the page is removed from the free_pages list >> it will have its recount set to 1 to make it non-free then it will go >> through the migration. >> >> Alistair how should the refcounting be working here in hmm_test? >> >> Jason