On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote: > On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote: > > > If I understand the motivation right, it is mostly about being able to mmap > > > PMD-sized chunks to userspace. So my naive idea would be that we could just > > > implement it by allocating PMD sized chunks of pages when adding pages to > > > page cache, we don't even have to read them all unless we come from PMD > > > fault path. > > > > Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc} > > per-hugepage, one common list of buffer heads... > > > > PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling > > it otherwise doesn't make sense) and handling it differently for file-THP > > is nightmare from maintenance POV. > > But the complexity of two different page sizes for page cache and *each* > filesystem that wants to support it does not make the maintenance easy > either. I think with time we can make small pages just a subcase of huge pages. And some generalization can be made once more than one filesystem with backing storage will adopt huge pages. > So I'm not convinced that using the same rules for anon-THP and > file-THP is a clear win. We already have file-THP with the same rules: tmpfs. Backing storage is what changes the picture. > And if we have these two options neither of which has negligible > maintenance cost, I'd also like to see more justification for why it is > a good idea to have file-THP for normal filesystems. Do you have any > performance numbers that show it is a win under some realistic workload? See below. As usual with huge pages, they make sense when you plenty of memory. > I'd also note that having PMD-sized pages has some obvious disadvantages as > well: > > 1) I'm not sure buffer head handling code will quite scale to 512 or even > 2048 buffer_heads on a linked list referenced from a page. It may work but > I suspect the performance will suck. Yes, buffer_head list doesn't scale. That's the main reason (along with 4) why syscall-based IO sucks. We spend a lot of time looking for desired block. We need to switch to some other data structure for storing buffer_heads. Is there a reason why we have list there in first place? Why not just array? I will look into it, but this sounds like a separate infrastructure change project. > 2) PMD-sized pages result in increased space & memory usage. Space? Do you mean disk space? Not really: we still don't write beyond i_size or into holes. Behaviour wrt to holes may change with mmap()-IO as we have less granularity, but the same can be seen just between different architectures: 4k vs. 64k base page size. > 3) In ext4 we have to estimate how much metadata we may need to modify when > allocating blocks underlying a page in the worst case (you don't seem to > update this estimate in your patch set). With 2048 blocks underlying a page, > each possibly in a different block group, it is a lot of metadata forcing > us to reserve a large transaction (not sure if you'll be able to even > reserve such large transaction with the default journal size), which again > makes things slower. I didn't saw this on profiles. And xfstests looks fine. I probably need to run them with 1k blocks once again. > 4) As you have noted some places like write_begin() still depend on 4k > pages which creates a strange mix of places that use subpages and that use > head pages. Yes, this need to be addressed to restore syscall-IO performance and take advantage of huge pages. But again, it's an infrastructure change that would likely affect interface between VFS and filesystems. It deserves a separate patchset. > All this would be a non-issue (well, except 2 I guess) if we just didn't > expose filesystems to the fact that something like file-THP exists. The numbers below generated with fio. The working set is relatively small, so it fits into page cache and writing set doesn't hit dirty_ratio. I think the mmap performance should be enough to justify initial inclusion of an experimental feature: it useful for workloads that targets mmap()-IO. It will take time to get feature mature anyway. Configuration: - 2x E5-2697v2, 64G RAM; - INTEL SSDSC2CW24; - IO request size is 4k; - 8 processes, 512MB data set each; Workload read/write baseline stddev huge=always stddev change -------------------------------------------------------------------------------- sync-read read 21439.00 348.14 20297.33 259.62 -5.33% sync-write write 6833.20 147.08 3630.13 52.86 -46.88% sync-readwrite read 4377.17 17.53 2366.33 19.52 -45.94% write 4378.50 17.83 2365.80 19.94 -45.97% sync-randread read 5491.20 66.66 14664.00 288.29 167.05% sync-randwrite write 6396.13 98.79 2035.80 8.17 -68.17% sync-randrw read 2927.30 115.81 1036.08 34.67 -64.61% write 2926.47 116.45 1036.11 34.90 -64.60% libaio-read read 254.36 12.49 258.63 11.29 1.68% libaio-write write 4979.20 122.75 2904.77 17.93 -41.66% libaio-readwrite read 2738.57 142.72 2045.80 4.12 -25.30% write 2729.93 141.80 2039.77 3.79 -25.28% libaio-randread read 113.63 2.98 210.63 5.07 85.37% libaio-randwrite write 4456.10 76.21 1649.63 7.00 -62.98% libaio-randrw read 97.85 8.03 877.49 28.27 796.80% write 97.55 7.99 874.83 28.19 796.77% mmap-read read 20654.67 304.48 24696.33 1064.07 19.57% mmap-write write 8652.33 272.44 13187.33 499.10 52.41% mmap-readwrite read 6620.57 16.05 9221.60 399.56 39.29% write 6623.63 16.34 9222.13 399.31 39.23% mmap-randread read 6717.23 1360.55 21939.33 326.38 226.61% mmap-randwrite write 3204.63 253.66 12371.00 61.49 286.03% mmap-randrw read 2150.50 78.00 7682.67 188.59 257.25% write 2149.50 78.00 7685.40 188.35 257.54% -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html