Hi, This series try to add support for high order folios in shmem write and fallocate paths when swap is disabled (noswap option). This is part of the Large Block Size (LBS) effort [1][2] and a continuation of the shmem work from Luis here [3] following Matthew Wilcox's suggestion [4] regarding the path to take for the folio allocation order calculation. [1] https://kernelnewbies.org/KernelProjects/large-block-size [2] https://docs.google.com/spreadsheets/d/e/2PACX-1vS7sQfw90S00l2rfOKm83Jlg0px8KxMQE4HHp_DKRGbAGcAV-xu6LITHBEc4xzVh9wLH6WM2lR0cZS8/pubhtml# [3] RFC v2 add support for blocksize > PAGE_SIZE https://lore.kernel.org/all/ZHBowMEDfyrAAOWH@xxxxxxxxxxxxxxxxxxxxxx/T/#md3e93ab46ce2ad9254e1eb54ffe71211988b5632 [4] https://lore.kernel.org/all/ZHD9zmIeNXICDaRJ@xxxxxxxxxxxxxxxxxxxx/ I went from the latest v2 to an RFC because my current implementation is broken the moment large folios is enabled in the fallocate path. So this is a work in progress RFC patch series (and therefore incomplete). The issue was identified when running this series on fstests for tmpfs and fail on generic/285 and generic/436 tests (lseek DATA/HOLE) [5][6] with large folios support in the fallocate path. To fix these regressions I try adding support for per-block tracking of the uptodate flag instead of doing it per folio. I borrowed this implementation from iomap but I may not integrated it correctly. I think this was introduced in iomap few years back to address the problem when block size < PS [7], and recently being optimized [8] and added the ability of tracking per-block dirty flag. With large folios, per-block (page) uptodate is needed, otherwise the entire large folio is marked as uptodate in shmem_write_end() making the lseek HOLE and DATA tests fail (above tests). These per-block uptodate tracking support [9] fixes the above mentioned generic tests but introduces new errors easily reproducible with fsx [10]. In addition, this other thread [11] explains the performance problem with XFS for high order folios in iomap write path but I think here we need at least the uptodate only because of the above reasoning. Please, find below the logs [5][6] for lseek DATA/HOLE before and after the fixes. And the logs [10] for the fsx failure. I'm looking forward for your comments for error correction and to determine the overall vailidty of the approach. Note: In case people are interested in testing, we've added testing support for tmpfs in kdevops using (x)fstests. Please find the link to the baseline results in the below changes section. Available profiles are: * default (no mount options) * huge=always * huge=within_size * huge=advise * noswap, huge=never * noswap, huge=always * noswap, huge=within_size * noswap, huge=advise Changes since v2 * Rebased onto next-20231027 including latests changes for shmem and mempolicy. * Testing tmpfs using fstests with kdevops. Baseline results for different linux-next tags can be found here: https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges/6.6.0-rc6-next-20231019/tmpfs/unassigned https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges/6.6.0-rc4-next-20231006/tmpfs/unassigned https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges/6.6.0-rc4-next-20231004/tmpfs/unassigned * Added XArray tests to prove order is not kept when replacing an entry with NULL when using cmpxchg. Required for patch 'shmem: return number of pages beeing freed in shmem_free_swap' * Added XArray test for multi-index use. * Drop huge argument in shmem_alloc_and_add_folio() and make use of VM_HUGEPAGE instead. * Increase max order from PMD_ORDER-1 to PMD_ORDER (MAX_PAGECACHE_ORDER). * Add/fix shmem_free_swap conversion to return (properly) the number of pages freed. * Fix order patch being changed in further patch. However, I do initialize order = 0 in patch [patch-order] and then updated to the mapping size in patch [patch-high-order]. I can merge both patches if necessary to avoid this change in the series. [patch-order]: shmem: add order arg to shmem_alloc_folio() [patch-high-order]: shmem: add large folio support to the write path * Folio order tracing when added to page cache. * THP vs large folios in the write path: if huge flag is passed and kernel has support for THP, then allocation will use huge the path, otherwise folio order will be used, based on the file size without using huge_gfp flags. * Add patch to remove huge flag argument from shmem_alloc_and_add_folio. We can check for the huge flag being set as part of gfp flags (VM_HUGEPAGE). Check patch: 'shmem: remove huge arg from shmem_alloc_and_add_folio()'. * Add high order folios in fallocate path. * Add per-block uptodate tracking based on iomap implementation (work in progress). Changes since v1 * Order handling code simplified in shmem_get_folio_gfp after Matthew Willcox's review. * Drop patch 1/6 [filemap] and merge mapping_size_order code directly in shmem. [filemap] filemap: make the folio order calculation shareable * Added MAX_SHMEM_ORDER to make it explicit we don't have the same max order as in pagecache (MAX_PAGECACHE_ORDER). * Use HPAGE_PMD_ORDER-1 as MAX_SHMEM_ORDER to respect huge mount option. * Update cover letter: drop huge strategy question and add more context regarding LBS project. Add fsx and fstests summary with new baseline. * Add fixes found by Matthew in patch 3/6 [acct]. [acct] shmem: account for large order folios * Fix length (i_size_read -> PAGE_SIZE) that is passed to shmem_get_folio_gfp in shmem_fault and shmem_read_folio_gfp to PAGE_SIZE. * Add patch as suggested by Matthew to return the number of pages freed in shmem_free_swap (instead of errno). When no pages are freed, return 0 (pages). Note: As an alternative, we can embed -ENOENT and make use of IS_ERR_VALUE. Approach discarded because little value was added. If this method is preferred, please let discuss it. [5] (x)ftests regression with large folios in the fallocate path: generic/285: src/seek_sanity_test/test09() generic/436: src/seek_sanity_test/test13() [6] (x)ftests, how to check/reproduce regressions: ```sh mkdir -p /mnt/test-tmpfs ./src/seek_sanity_test -s 9 -e 9 /mnt/test-tmpfs/file ./src/seek_sanity_test -s 13 -e 13 /mnt/test-tmpfs/file umount /mnt/test-tmpfs ``` [7] iomap per-block uptodate tracking in iomap: 9dc55f1389f9 iomap: add support for sub-pagesize buffered I/O without buffer heads 1cea335d1db1 iomap: fix sub-page uptodate handling [8] iomap per-block dirty and uptodate flags optimizations in iomap 4ce02c679722 iomap: Add per-block dirty state tracking to improve performance 35d30c9cf127 iomap: don't skip reading in !uptodate folios when unsharing a range a01b8f225248 iomap: Allocate ifs in ->write_begin() early 7f79d85b525b iomap: Refactor iomap_write_delalloc_punch() function out 0af2b37d8e7a iomap: Use iomap_punch_t typedef eee2d2e6ea55 iomap: Fix possible overflow condition in iomap_write_delalloc_scan cc86181a3b76 iomap: Add some uptodate state handling helpers for ifs state bitmap 3ea5c76cadee iomap: Drop ifs argument from iomap_set_range_uptodate() 04f52c4e6f80 iomap: Rename iomap_page to iomap_folio_state and others [9] Patch: shmem: add per-block uptodate tracking [10] fsx up to 633 ops (or up to 1200 without -X). ```sh mkdir -p /mnt/test-tmpfs mount -t tmpfs -o size=1G -o noswap tmpfs /mnt/test-tmpfs /root/xfstests-dev/ltp/fsx /mnt/test-tmpfs/file -d -N 1200 -X umount /mnt/test-tmpfs ``` Logs: ```logs READ BAD DATA: offset = 0x0, size = 0x3364c, fname = /mnt/test-tmpfs/file OFFSET GOOD BAD RANGE 0x28000 0x79d0 0x0000 0x0 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x28001 0xd079 0x0000 0x1 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x28002 0x7914 0x0000 0x2 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x28003 0x1479 0x0000 0x3 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x28004 0x79ec 0x0000 0x4 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x28005 0xec79 0x0000 0x5 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x28006 0x7929 0x0000 0x6 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x28007 0x2979 0x0000 0x7 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x28008 0x7935 0x0000 0x8 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x28009 0x3579 0x0000 0x9 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x2800a 0x7968 0x0000 0xa operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x2800b 0x6879 0x0000 0xb operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x2800c 0x79d3 0x0000 0xc operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x2800d 0xd379 0x0000 0xd operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x2800e 0x79f2 0x0000 0xe operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x2800f 0xf279 0x0000 0xf operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops LOG DUMP (633 total operations): 1( 1 mod 256): SKIPPED (no operation) 2( 2 mod 256): TRUNCATE UP from 0x0 to 0x3aea7 ******WWWW 3( 3 mod 256): COPY 0x1a3d6 thru 0x26607 (0xc232 bytes) to 0x2ea8c thru 0x3acbd 4( 4 mod 256): READ 0x2f6d2 thru 0x3853e (0x8e6d bytes) 5( 5 mod 256): READ 0x2d6d3 thru 0x310f5 (0x3a23 bytes) ***RRRR*** 6( 6 mod 256): SKIPPED (no operation) 7( 7 mod 256): WRITE 0x341a2 thru 0x3fafc (0xb95b bytes) EXTEND ... 625(113 mod 256): PUNCH 0x8f01 thru 0x9806 (0x906 bytes) 626(114 mod 256): MAPREAD 0xe517 thru 0x11396 (0x2e80 bytes) 627(115 mod 256): SKIPPED (no operation) 628(116 mod 256): SKIPPED (no operation) 629(117 mod 256): FALLOC 0x284fe thru 0x32480 (0x9f82 bytes) EXTENDING ******FFFF 630(118 mod 256): WRITE 0x333ac thru 0x3364b (0x2a0 bytes) HOLE 631(119 mod 256): SKIPPED (no operation) 632(120 mod 256): SKIPPED (no operation) 633(121 mod 256): WRITE 0x1f876 thru 0x2d86a (0xdff5 bytes) ***WWWW ``` Daniel Gomez (9): XArray: add cmpxchg order test shmem: drop BLOCKS_PER_PAGE macro shmem: return number of pages beeing freed in shmem_free_swap shmem: trace shmem_add_to_page_cache folio order shmem: remove huge arg from shmem_alloc_and_add_folio() shmem: add file length arg in shmem_get_folio() path shmem: add order arg to shmem_alloc_folio() shmem: add large folio support to the write path shmem: add per-block uptodate tracking Luis Chamberlain (2): test_xarray: add tests for advanced multi-index use shmem: account for large order folios MAINTAINERS | 1 + include/linux/shmem_fs.h | 2 +- include/trace/events/shmem.h | 52 ++++++ lib/test_xarray.c | 155 +++++++++++++++++ mm/khugepaged.c | 3 +- mm/shmem.c | 325 ++++++++++++++++++++++++++++------- mm/userfaultfd.c | 2 +- 7 files changed, 472 insertions(+), 68 deletions(-) create mode 100644 include/trace/events/shmem.h -- 2.39.2