Changes since v6[10]: * Patch 4, Wrap commit message to 73 characters max (Christoph Hellwig) * Patch 4, Move pfn_next() in for_each_device_pfn() to new line (Christoph Hellwig) * Patch 4, Move pfn range computation to a pfn_len() helper. (Christoph Hellwig) * Patch 9, Remove @fault_size as it's no longer used (also reported by kbuild robot). * New Patch 10, remove unneeded @pfn output parameter from dev_dax_huge_fault() (Christoph Helwig) -- this is done in a new patch "device-dax: remove pfn from __dev_dax_{pte,pmd,pud}_fault()" Series is meant to replace what's merged in mmotm/linux-next. Only patch 4 has changed but I added a new cleanup patch suggested by Christoph which is what prompted to send the entire series. This was based on linux-next tag next-20211124 (commit 4b74e088fef6) same as to be replaced v6. Let me know if there's another preferred way of doing this (e.g. send patch 10 separate as a follow up and just picking up this series patch 4 as mmotm already has patch 9 fix)). --- This series converts device-dax to use compound pages, and moves away from the 'struct page per basepage on PMD/PUD' that is done today. Doing so, 1) unlocks a few noticeable improvements on unpin_user_pages() and makes device-dax+altmap case 4x times faster in pinning (numbers below and in last patch) 2) as mentioned in various other threads it's one important step towards cleaning up ZONE_DEVICE refcounting. I've split the compound pages on devmap part from the rest based on recent discussions on devmap pending and future work planned[5][6]. There is consensus that device-dax should be using compound pages to represent its PMD/PUDs just like HugeTLB and THP, and that leads to less specialization of the dax parts. I will pursue the rest of the work in parallel once this part is merged, particular the GUP-{slow,fast} improvements [7] and the tail struct page deduplication memory savings part[8]. To summarize what the series does: Patch 1: Prepare hwpoisoning to work with dax compound pages. Patches 2-3: Split the current utility function of prep_compound_page() into head and tail and use those two helpers where appropriate to take advantage of caches being warm after __init_single_page(). This is used when initializing zone device when we bring up device-dax namespaces. Patches 4-10: Add devmap support for compound pages in device-dax. memmap_init_zone_device() initialize its metadata as compound pages, and it introduces a new devmap property known as vmemmap_shift which outlines how the vmemmap is structured (defaults to base pages as done today). The property describe the page order of the metadata essentially. While at it do a few cleanups in device-dax in patches 5-9. Finally enable device-dax usage of devmap @vmemmap_shift to a value based on its own @align property. @vmemmap_shift returns 0 by default (which is today's case of base pages in devmap, like fsdax or the others) and the usage of compound devmap is optional. Starting with device-dax (*not* fsdax) we enable it by default. There are a few pinning improvements particular on the unpinning case and altmap, as well as unpin_user_page_range_dirty_lock() being just as effective as THP/hugetlb[0] pages. $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms [altmap] (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms [altmap with -m 127004] (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with and without altmap), alongside gup_test selftests with dynamic dax regions and static dax regions. Coupled with ndctl unit tests for dynamic dax devices that exercise all of this. Note, for dynamic dax regions I had to revert commit 8aa83e6395 ("x86/setup: Call early_reserve_memory() earlier"), it is a known issue that this commit broke efi_fake_mem=. Patches apply on top of linux-next tag next-20211124 (commit 4b74e088fef6). Thanks for all the review so far. As always, Comments and suggestions very much appreciated! Older Changelog, v5[9] -> v6[10]: * Keep @dev on the previous line to improve readability on patch 5 (Christoph Hellwig) * Document is_static() function to clarify what are static and dynamic dax regions in patch 7 (Christoph Hellwig) * Deduce @f_mapping and @pgmap from vmf->vma->vm_file to reduce the number of arguments of set_{page,compound}_mapping() in last patch (Christoph Hellwig) * Factor out @mapping initialization to a separate helper ([new] patch 8) and rename set_page_mapping() to dax_set_mapping() in the process. * Remove set_compound_mapping() and instead adjust dax_set_mapping() to handle @vmemmap_shift case on the last patch. This greatly simplifies the last patch, and addresses a similar comment by Christoph on having an earlier return. No functional change on the changes to dax_set_mapping compared to its earlier version so I retained Dan's Rb on last patch. * Initialize the mapping prior to inserting the PTE/PMD/PUD as opposed to after the fact. ([new] patch 9, Jason Gunthorpe) Patches 8 and 9 are new (small cleanups) in v6. Patches 6 - 9 are the ones missing Rb tags. v4[4] -> v5[9]: * Remove patches 8-14 as they will go in 2 separate (parallel) series; * Rename @geometry to @vmemmap_shift (Christoph Hellwig) * Make @vmemmap_shift an order rather than nr of pages (Christoph Hellwig) * Consequently remove helper pgmap_geometry_order() as it's no longer needed, in place of accessing directly the structure member [Patch 4 and 8] * Rename pgmap_geometry() to pgmap_vmemmap_nr() in patches 4 and 8; * Remove usage of pgmap_geometry() in favour for testing @vmemmap_shift for non-zero directly directly in patch 8; * Patch 5 is new for using `struct_size()` (Dan Williams) * Add a 'static_dev_dax()' helper for testing pgmap == NULL handling for dynamic dax devices. * Expand patch 6 to be explicitly on those !pgmap cases, and replace those with static_dev_dax(). * Add performance numbers on patch 8 on gup/pin_user_pages() numbers with this series. * Massage commit description to remove mentions of @geometry. * Add Dan's Reviewed-by on patch 8 (Dan Williams) v3[3] -> v4[4]: * Collect Dan's Reviewed-by on patches 1-5,8,9,11 * Collect Muchun Reviewed-by on patch 1,2,11 * Reorder patches to first introduce compound pages in ZONE_DEVICE with device-dax (for pmem) as first user (patches 1-8) followed by implementing the sparse-vmemmap changes for minimize struct page overhead for devmap (patches 9-14) * Eliminate remnant @align references to use @geometry (Dan) * Convert mentions of 'compound pagemap' to 'compound devmap' throughout the series to avoid confusions of this work conflicting/referring to anything Folio or pagemap related. * Delete pgmap_pfn_geometry() on patch 4 and rework other patches to use pgmap_geometry() instead (Dan) * Convert @geometry to be a number of pages rather than page size in patch 4 (Dan) * Make pgmap_geometry() more readable (Christoph) * Simplify pgmap refcount pfn computation in memremap_pages() (Christoph) * Rework memmap_init_compound() in patch 4 to use the same style as memmap_init_zone_device i.e. iterating over PFNs, rather than struct pages (Dan) * Add comment on devmap prep_compound_head callsite explaining why it needs to be used after first+second tail pages have been initialized (Dan, Jane) * Initialize tail page refcount to zero in patch 4 * Make sure pfn_next() iterate over compound pages (rather than base page) in patch 4 to tackle the zone_device elevated page refcount. [ Note these last two bullet points above are unneeded once this patch is merged: https://lore.kernel.org/linux-mm/20210825034828.12927-3-alex.sierra@xxxxxxx/ ] * Remove usage of ternary operator when computing @end in gup_device_huge() in patch 8 (Dan) * Remove pinned_head variable in patch 8 * Remove put_dev_pagemap() need for compound case as that is now fixed for the general case in patch 8 * Switch to PageHead() instead of PageCompound() as we only work with either base pages or head pages in patch 8 (Matthew) * Fix kdoc of @altmap and improve kdoc for @pgmap in patch 9 (Dan) * Fix up missing return in vmemmap_populate_address() in patch 10 * Change error handling style in all patches (Dan) * Change title of vmemmap_dedup.rst to be more representative of the purpose in patch 12 (Dan) * Move some of the section and subsection tail page reuse code into helpers reuse_compound_section() and compound_section_tail_page() for readability in patch 12 (Dan) * Commit description fixes for clearity in various patches (Dan) * Add pgmap_geometry_order() helper and drop unneeded geometry_size, order variables in patch 12 * Drop unneeded byte based computation to be PFN in patch 12 * Handle the dynamic dax region properly when ensuring a stable dev_dax->pgmap in patch 6. * Add a compound_nr_pages() helper and use it in memmap_init_zone_device to calculate the number of unique struct pages to initialize depending on @altmap existence in patch 13 (Dan) * Add compound_section_tail_huge_page() for the tail page PMD reuse in patch 14 (Dan) * Reword cover letter. v2 -> v3[3]: * Collect Mike's Ack on patch 2 (Mike) * Collect Naoya's Reviewed-by on patch 1 (Naoya) * Rename compound_pagemaps.rst doc page (and its mentions) to vmemmap_dedup.rst (Mike, Muchun) * Rebased to next-20210714 v1[1] -> v2[2]: (New patches 7, 10, 11) * Remove occurences of 'we' in the commit descriptions (now for real) [Dan] * Add comment on top of compound_head() for fsdax (Patch 1) [Dan] * Massage commit descriptions of cleanup/refactor patches to reflect [Dan] that it's in preparation for bigger infra in sparse-vmemmap. (Patch 2,3,5) [Dan] * Greatly improve all commit messages in terms of grammar/wording and clearity. [Dan] * Rename variable/helpers from dev_pagemap::align to @geometry, reflecting tht it's not the same thing as dev_dax->align, Patch 4 [Dan] * Move compound page init logic into separate memmap_init_compound() helper, Patch 4 [Dan] * Simplify patch 9 as a result of having compound initialization differently [Dan] * Rename @pfn_align variable in memmap_init_zone_device to @pfns_per_compound [Dan] * Rename Subject of patch 6 [Dan] * Move hugetlb_vmemmap.c comment block to Documentation/vm Patch 7 [Dan] * Add some type-safety to @block and use 'struct page *' rather than void, Patch 8 [Dan] * Add some comments to less obvious parts on 1G compound page case, Patch 8 [Dan] * Remove vmemmap lookup function in place of pmd_off_k() + pte_offset_kernel() given some guarantees on section onlining serialization, Patch 8 * Add a comment to get_page() mentioning where/how it is, Patch 8 freed [Dan] * Add docs about device-dax usage of tail dedup technique in newly added compound_pagemaps.rst doc entry. * Add cleanup patch for device-dax for ensuring dev_dax::pgmap is always set [Dan] * Add cleanup patch for device-dax for using ALIGN() [Dan] * Store pinned head in separate @pinned_head variable and fix error case, patch 13 [Dan] * Add comment on difference of @next value for PageCompound(), patch 13 [Dan] * Move PUD compound page to be last patch [Dan] * Add vmemmap layout for PUD compound geometry in compound_pagemaps.rst doc, patch 14 [Dan] * Rebased to next-20210617 RFC[0] -> v1: (New patches 1-3, 5-8 but the diffstat isn't that different) * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1) * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew) * Use pfn_align to be clear it's nr of pages for @align value (John, Dan) * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align; * Remove the gup_device_compound_huge special path and have the same code work both ways while special casing when devmap page is compound (Jason, John) * Avoid usage of vmemmap_populate_basepages() and introduce a first class loop that doesn't care about passing an altmap for memmap reuse. (Dan) * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section hack into passing block across sparse_add_section calls. It's a lot easier to follow and more explicit in what it does. * Replace the vmemmap refactoring with adding a @pgmap argument and moving parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result) * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new) * Improve memmap_init_zone_device() to initialize compound pages when struct pages are cache warm. That lead to a even further speed up further from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones as a result (Dan) * Remove PGMAP_COMPOUND and use @align as the property to detect whether or not to reuse vmemmap areas (Dan) [0] https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.martins@xxxxxxxxxx/ [1] https://lore.kernel.org/linux-mm/20210325230938.30752-1-joao.m.martins@xxxxxxxxxx/ [2] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@xxxxxxxxxx/ [3] https://lore.kernel.org/linux-mm/20210714193542.21857-1-joao.m.martins@xxxxxxxxxx/ [4] https://lore.kernel.org/linux-mm/20210827145819.16471-1-joao.m.martins@xxxxxxxxxx/ [5] https://lore.kernel.org/linux-mm/20211018182559.GC3686969@xxxxxxxx/ [6] https://lore.kernel.org/linux-mm/499043a0-b3d8-7a42-4aee-84b81f5b633f@xxxxxxxxxx/ [7] https://lore.kernel.org/linux-mm/20210827145819.16471-9-joao.m.martins@xxxxxxxxxx/ [8] https://lore.kernel.org/linux-mm/20210827145819.16471-13-joao.m.martins@xxxxxxxxxx/ [9] https://lore.kernel.org/linux-mm/20211112150824.11028-1-joao.m.martins@xxxxxxxxxx/ [10] https://lore.kernel.org/linux-mm/20211124191005.20783-1-joao.m.martins@xxxxxxxxxx/ Joao Martins (11): memory-failure: fetch compound_head after pgmap_pfn_valid() mm/page_alloc: split prep_compound_page into head and tail subparts mm/page_alloc: refactor memmap_init_zone_device() page init mm/memremap: add ZONE_DEVICE support for compound pages device-dax: use ALIGN() for determining pgoff device-dax: use struct_size() device-dax: ensure dev_dax->pgmap is valid for dynamic devices device-dax: factor out page mapping initialization device-dax: set mapping prior to vmf_insert_pfn{,_pmd,pud}() device-dax: remove pfn from __dev_dax_{pte,pmd,pud}_fault() device-dax: compound devmap support drivers/dax/bus.c | 32 +++++++++ drivers/dax/bus.h | 1 + drivers/dax/device.c | 124 +++++++++++++++++++++-------------- include/linux/memremap.h | 11 ++++ mm/memory-failure.c | 6 ++ mm/memremap.c | 18 +++-- mm/page_alloc.c | 138 +++++++++++++++++++++++++++------------ 7 files changed, 233 insertions(+), 97 deletions(-) -- 2.17.2