It brings thp support for ramfs, but without mmap() -- it will be posted separately. Please review and consider applying. Intro ----- The goal of the project is preparing kernel infrastructure to handle huge pages in page cache. To proof that the proposed changes are functional we enable the feature for the most simple file system -- ramfs. ramfs is not that useful by itself, but it's good pilot project. Design overview --------------- Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR (512 on x86-64) entries. All entries points to head page -- refcounting for tail pages is pretty expensive. Radix tree manipulations are implemented in batched way: we add and remove whole huge page at once, under one tree_lock. To make it possible, we extended radix-tree interface to be able to pre-allocate memory enough to insert a number of *contiguous* elements (kudos to Matthew Wilcox). Huge pages can be added to page cache three ways: - write(2) to file or page; - read(2) from sparse file; - fault sparse file. Potentially, one more way is collapsing small page, but it's outside initial implementation. For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's some room for speed up later. Since mmap() isn't targeted for this patchset, we just split huge page on page fault. To minimize memory overhead for small files we aviod write-allocation in first huge page area (2M on x86-64) of the file. truncate_inode_pages_range() drops whole huge page at once if it's fully inside the range. If a huge page is only partly in the range we zero out the part, exactly like we do for partial small pages. split_huge_page() for file pages works similar to anon pages, but we walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call truncate_inode_pages() to drop small pages beyond i_size, if any. inode->i_split_sem taken on read will protect hugepages in inode's pagecache against splitting. We take it on write during splitting. Changes since v5 ---------------- - change how hugepage stored in pagecache: head page for all relevant indexes; - introduce i_split_sem; - do not create huge pages on write(2) into first hugepage area; - compile-disabled by default; - fix transparent_hugepage_pagecache(); Benchmarks ---------- Since the patchset doesn't include mmap() support, we should expect much change in performance. We just need to check that we don't introduce any major regression. On average read/write on ramfs with thp is a bit slower, but I don't think it's a stopper -- ramfs is a toy anyway, on real world filesystems I expect difference to be smaller. postmark ======== workload1: chmod +x postmark mount -t ramfs none /mnt cat >/root/workload1 <<EOF set transactions 250000 set size 5120 524288 set number 500 run quit workload2: set transactions 10000 set size 2097152 10485760 set number 100 run quit throughput (transactions/sec) workload1 workload2 baseline 8333 416 patched 8333 454 FS-Mark ======= throughput (files/sec) 2000 files by 1M 200 files by 10M baseline 5326.1 548.1 patched 5192.8 528.4 tiobench ======== baseline: Tiotest results for 16 concurrent io threads: ,----------------------------------------------------------------------. | Item | Time | Rate | Usr CPU | Sys CPU | +-----------------------+----------+--------------+----------+---------+ | Write 2048 MBs | 0.2 s | 8667.792 MB/s | 445.2 % | 5535.9 % | | Random Write 62 MBs | 0.0 s | 8341.118 MB/s | 0.0 % | 2615.8 % | | Read 2048 MBs | 0.2 s | 11680.431 MB/s | 339.9 % | 5470.6 % | | Random Read 62 MBs | 0.0 s | 9451.081 MB/s | 786.3 % | 1451.7 % | `----------------------------------------------------------------------' Tiotest latency results: ,-------------------------------------------------------------------------. | Item | Average latency | Maximum latency | % >2 sec | % >10 sec | +--------------+-----------------+-----------------+----------+-----------+ | Write | 0.006 ms | 28.019 ms | 0.00000 | 0.00000 | | Random Write | 0.002 ms | 5.574 ms | 0.00000 | 0.00000 | | Read | 0.005 ms | 28.018 ms | 0.00000 | 0.00000 | | Random Read | 0.002 ms | 4.852 ms | 0.00000 | 0.00000 | |--------------+-----------------+-----------------+----------+-----------| | Total | 0.005 ms | 28.019 ms | 0.00000 | 0.00000 | `--------------+-----------------+-----------------+----------+-----------' patched: Tiotest results for 16 concurrent io threads: ,----------------------------------------------------------------------. | Item | Time | Rate | Usr CPU | Sys CPU | +-----------------------+----------+--------------+----------+---------+ | Write 2048 MBs | 0.3 s | 7942.818 MB/s | 442.1 % | 5533.6 % | | Random Write 62 MBs | 0.0 s | 9425.426 MB/s | 723.9 % | 965.2 % | | Read 2048 MBs | 0.2 s | 11998.008 MB/s | 374.9 % | 5485.8 % | | Random Read 62 MBs | 0.0 s | 9823.955 MB/s | 251.5 % | 2011.9 % | `----------------------------------------------------------------------' Tiotest latency results: ,-------------------------------------------------------------------------. | Item | Average latency | Maximum latency | % >2 sec | % >10 sec | +--------------+-----------------+-----------------+----------+-----------+ | Write | 0.007 ms | 28.020 ms | 0.00000 | 0.00000 | | Random Write | 0.001 ms | 0.022 ms | 0.00000 | 0.00000 | | Read | 0.004 ms | 24.011 ms | 0.00000 | 0.00000 | | Random Read | 0.001 ms | 0.019 ms | 0.00000 | 0.00000 | |--------------+-----------------+-----------------+----------+-----------| | Total | 0.005 ms | 28.020 ms | 0.00000 | 0.00000 | `--------------+-----------------+-----------------+----------+-----------' IOZone ====== Syscalls, not mmap. ** Initial writers ** threads: 1 2 4 8 10 20 30 40 50 60 70 80 baseline: 4741691 7986408 9149064 9898695 9868597 9629383 9469202 11605064 9507802 10641869 11360701 11040376 patched: 4682864 7275535 8691034 8872887 8712492 8771912 8397216 7701346 7366853 8839736 8299893 10788439 speed-up(times): 0.99 0.91 0.95 0.90 0.88 0.91 0.89 0.66 0.77 0.83 0.73 0.98 ** Rewriters ** threads: 1 2 4 8 10 20 30 40 50 60 70 80 baseline: 5807891 9554869 12101083 13113533 12989751 14359910 16998236 16833861 24735659 17502634 17396706 20448655 patched: 6161690 9981294 12285789 13428846 13610058 13669153 20060182 17328347 24109999 19247934 24225103 34686574 speed-up(times): 1.06 1.04 1.02 1.02 1.05 0.95 1.18 1.03 0.97 1.10 1.39 1.70 ** Readers ** threads: 1 2 4 8 10 20 30 40 50 60 70 80 baseline: 7978066 11825735 13808941 14049598 14765175 14422642 17322681 23209831 21386483 20060744 22032935 31166663 patched: 7723293 11481500 13796383 14363808 14353966 14979865 17648225 18701258 29192810 23973723 22163317 23104638 speed-up(times): 0.97 0.97 1.00 1.02 0.97 1.04 1.02 0.81 1.37 1.20 1.01 0.74 ** Re-readers ** threads: 1 2 4 8 10 20 30 40 50 60 70 80 baseline: 7966269 11878323 14000782 14678206 14154235 14271991 15170829 20924052 27393344 19114990 12509316 18495597 patched: 7719350 11410937 13710233 13232756 14040928 15895021 16279330 17256068 26023572 18364678 27834483 23288680 speed-up(times): 0.97 0.96 0.98 0.90 0.99 1.11 1.07 0.82 0.95 0.96 2.23 1.26 ** Reverse readers ** threads: 1 2 4 8 10 20 30 40 50 60 70 80 baseline: 6630795 10331013 12839501 13157433 12783323 13580283 15753068 15434572 21928982 17636994 14737489 19470679 patched: 6502341 9887711 12639278 12979232 13212825 12928255 13961195 14695786 21370667 19873807 20902582 21892899 speed-up(times): 0.98 0.96 0.98 0.99 1.03 0.95 0.89 0.95 0.97 1.13 1.42 1.12 ** Random_readers ** threads: 1 2 4 8 10 20 30 40 50 60 70 80 baseline: 5152935 9043813 11752615 11996078 12283579 12484039 14588004 15781507 23847538 15748906 13698335 27195847 patched: 5009089 8438137 11266015 11631218 12093650 12779308 17768691 13640378 30468890 19269033 23444358 22775908 speed-up(times): 0.97 0.93 0.96 0.97 0.98 1.02 1.22 0.86 1.28 1.22 1.71 0.84 ** Random_writers ** threads: 1 2 4 8 10 20 30 40 50 60 70 80 baseline: 3886268 7405345 10531192 10858984 10994693 12758450 10729531 9656825 10370144 13139452 4528331 12615812 patched: 4335323 7916132 10978892 11423247 11790932 11424525 11798171 11413452 12230616 13075887 11165314 16925679 speed-up(times): 1.12 1.07 1.04 1.05 1.07 0.90 1.10 1.18 1.18 1.00 2.47 1.34 Kirill A. Shutemov (22): mm: implement zero_huge_user_segment and friends radix-tree: implement preload for multiple contiguous elements memcg, thp: charge huge cache pages thp: compile-time and sysfs knob for thp pagecache thp, mm: introduce mapping_can_have_hugepages() predicate thp: represent file thp pages in meminfo and friends thp, mm: rewrite add_to_page_cache_locked() to support huge pages mm: trace filemap: dump page order block: implement add_bdi_stat() thp, mm: rewrite delete_from_page_cache() to support huge pages thp, mm: warn if we try to use replace_page_cache_page() with THP thp, mm: add event counters for huge page alloc on file write or read mm, vfs: introduce i_split_sem thp, mm: allocate huge pages in grab_cache_page_write_begin() thp, mm: naive support of thp in generic_perform_write thp, mm: handle transhuge pages in do_generic_file_read() thp, libfs: initial thp support truncate: support huge pages thp: handle file pages in split_huge_page() thp: wait_split_huge_page(): serialize over i_mmap_mutex too thp, mm: split huge page on mmap file page ramfs: enable transparent huge page cache Documentation/vm/transhuge.txt | 16 ++++ drivers/base/node.c | 4 + fs/inode.c | 3 + fs/libfs.c | 58 +++++++++++- fs/proc/meminfo.c | 3 + fs/ramfs/file-mmu.c | 2 +- fs/ramfs/inode.c | 6 +- include/linux/backing-dev.h | 10 +++ include/linux/fs.h | 11 +++ include/linux/huge_mm.h | 68 +++++++++++++- include/linux/mm.h | 18 ++++ include/linux/mmzone.h | 1 + include/linux/page-flags.h | 13 +++ include/linux/pagemap.h | 31 +++++++ include/linux/radix-tree.h | 11 +++ include/linux/vm_event_item.h | 4 + include/trace/events/filemap.h | 7 +- lib/radix-tree.c | 94 ++++++++++++++++++-- mm/Kconfig | 11 +++ mm/filemap.c | 196 ++++++++++++++++++++++++++++++++--------- mm/huge_memory.c | 147 +++++++++++++++++++++++++++---- mm/memcontrol.c | 3 +- mm/memory.c | 40 ++++++++- mm/truncate.c | 125 ++++++++++++++++++++------ mm/vmstat.c | 5 ++ 25 files changed, 779 insertions(+), 108 deletions(-) -- 1.8.4.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html