Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell: > > On 6/17/21 8:16 AM, Alex Sierra wrote: >> From: Ralph Campbell <rcampbell@xxxxxxxxxx> >> >> ZONE_DEVICE struct pages have an extra reference count that >> complicates the >> code for put_page() and several places in the kernel that need to >> check the >> reference count to see that a page is not being used (gup, compaction, >> migration, etc.). Clean up the code so the reference count doesn't >> need to >> be treated specially for ZONE_DEVICE. >> >> v2: >> AS: merged this patch in linux 5.11 version >> >> Signed-off-by: Ralph Campbell <rcampbell@xxxxxxxxxx> >> Signed-off-by: Alex Sierra <alex.sierra@xxxxxxx> >> --- >> arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- >> drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- >> fs/dax.c | 4 +- >> include/linux/dax.h | 2 +- >> include/linux/memremap.h | 7 +-- >> include/linux/mm.h | 44 ----------------- >> lib/test_hmm.c | 2 +- >> mm/internal.h | 8 +++ >> mm/memremap.c | 68 +++++++------------------- >> mm/migrate.c | 5 -- >> mm/page_alloc.c | 3 ++ >> mm/swap.c | 45 ++--------------- >> 12 files changed, 45 insertions(+), 147 deletions(-) >> > I think it is great that you are picking this up and trying to revive it. > > However, I have a number of concerns about how it affects existing > ZONE_DEVICE > MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this > addressing them. For example, dev_dax_probe() allocates > MEMORY_DEVICE_GENERIC > struct pages and then: > dev_dax_fault() > dev_dax_huge_fault() > __dev_dax_pte_fault() > vmf_insert_mixed() > which just inserts the PFN into the CPU page tables without increasing > the page > refcount so it is zero (whereas it was one before). But using > get_page() will > trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current > notion of > free verses allocated for these struct pages. I suppose init_page_count() > could be called on all the struct pages in dev_dax_probe() to fix that > though. Hi Ralph, For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So I'm not sure what the reference counting is good for in this case. Alex is going to add free_zone_device_page support for DEVICE_GENERIC pages (patch 8 of this series). However, even then, it only does anything if there is an actual call to put_page. Where would that call come from in the dev_dax driver case? > > I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File > systems have clear > allocate and free states for backing storage but there are the > complications with > the page cache references, etc. to consider. The >1 to 1 reference > count seems to > be used to tell when a page is idle (no I/O, reclaim scanners) rather > than free > (not allocated to any file) but I'm not 100% sure about that since I > don't really > understand all the issues around why a file system needs to have a DAX > mount option > besides knowing that the storage block size has to be a multiple of > the page size. The only thing that happens in free_zone_device_page for FS_DAX pages is wake_up_var(&page->_refcount). I guess, whoever is waiting for this wake-up will need to be prepared to see a refcount 0 instead of 1 now. I see these callers that would need to be updated: ./fs/ext4/inode.c: error = ___wait_var_event(&page->_refcount, ./fs/ext4/inode.c- atomic_read(&page->_refcount) == 1, ./fs/ext4/inode.c- TASK_INTERRUPTIBLE, 0, 0, ./fs/ext4/inode.c- ext4_wait_dax_page(ei)); -- ./fs/fuse/dax.c: return ___wait_var_event(&page->_refcount, ./fs/fuse/dax.c- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, ./fs/fuse/dax.c- 0, 0, fuse_wait_dax_page(inode)); -- ./fs/xfs/xfs_file.c: return ___wait_var_event(&page->_refcount, ./fs/xfs/xfs_file.c- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, ./fs/xfs/xfs_file.c- 0, 0, xfs_wait_dax_page(inode)); Regarding your page-cache comment, doesn't DAX mean that those file pages are not in the page cache? Regards, Felix