Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

"Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> · Wed, 2 Nov 2016 11:32:04 +0300

On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> > > If I understand the motivation right, it is mostly about being able to mmap
> > > PMD-sized chunks to userspace. So my naive idea would be that we could just
> > > implement it by allocating PMD sized chunks of pages when adding pages to
> > > page cache, we don't even have to read them all unless we come from PMD
> > > fault path.
> > 
> > Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> > per-hugepage, one common list of buffer heads...
> > 
> > PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> > it otherwise doesn't make sense) and handling it differently for file-THP
> > is nightmare from maintenance POV.
> 
> But the complexity of two different page sizes for page cache and *each*
> filesystem that wants to support it does not make the maintenance easy
> either.

I think with time we can make small pages just a subcase of huge pages.
And some generalization can be made once more than one filesystem with
backing storage will adopt huge pages.

> So I'm not convinced that using the same rules for anon-THP and
> file-THP is a clear win.

We already have file-THP with the same rules: tmpfs. Backing storage is
what changes the picture.

> And if we have these two options neither of which has negligible
> maintenance cost, I'd also like to see more justification for why it is
> a good idea to have file-THP for normal filesystems. Do you have any
> performance numbers that show it is a win under some realistic workload?

See below. As usual with huge pages, they make sense when you plenty of
memory.

> I'd also note that having PMD-sized pages has some obvious disadvantages as
> well:
>
> 1) I'm not sure buffer head handling code will quite scale to 512 or even
> 2048 buffer_heads on a linked list referenced from a page. It may work but
> I suspect the performance will suck.

Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
why syscall-based IO sucks. We spend a lot of time looking for desired
block.

We need to switch to some other data structure for storing buffer_heads.
Is there a reason why we have list there in first place?
Why not just array?

I will look into it, but this sounds like a separate infrastructure change
project.

> 2) PMD-sized pages result in increased space & memory usage.

Space? Do you mean disk space? Not really: we still don't write beyond
i_size or into holes.

Behaviour wrt to holes may change with mmap()-IO as we have less
granularity, but the same can be seen just between different
architectures: 4k vs. 64k base page size.

> 3) In ext4 we have to estimate how much metadata we may need to modify when
> allocating blocks underlying a page in the worst case (you don't seem to
> update this estimate in your patch set). With 2048 blocks underlying a page,
> each possibly in a different block group, it is a lot of metadata forcing
> us to reserve a large transaction (not sure if you'll be able to even
> reserve such large transaction with the default journal size), which again
> makes things slower.

I didn't saw this on profiles. And xfstests looks fine. I probably need to
run them with 1k blocks once again.

> 4) As you have noted some places like write_begin() still depend on 4k
> pages which creates a strange mix of places that use subpages and that use
> head pages.

Yes, this need to be addressed to restore syscall-IO performance and take
advantage of huge pages.

But again, it's an infrastructure change that would likely affect
interface between VFS and filesystems. It deserves a separate patchset.

> All this would be a non-issue (well, except 2 I guess) if we just didn't
> expose filesystems to the fact that something like file-THP exists.

The numbers below generated with fio. The working set is relatively small,
so it fits into page cache and writing set doesn't hit dirty_ratio.

I think the mmap performance should be enough to justify initial inclusion
of an experimental feature: it useful for workloads that targets mmap()-IO.
It will take time to get feature mature anyway.

Configuration:
 - 2x E5-2697v2, 64G RAM;
 - INTEL SSDSC2CW24;
 - IO request size is 4k;
 - 8 processes, 512MB data set each;

Workload
 read/write	baseline	stddev	huge=always	stddev		change
--------------------------------------------------------------------------------
sync-read
 read		  21439.00	348.14	  20297.33	259.62		 -5.33%
sync-write
 write		   6833.20	147.08	   3630.13	 52.86		-46.88%
sync-readwrite
 read		   4377.17	 17.53	   2366.33	 19.52		-45.94%
 write		   4378.50	 17.83	   2365.80	 19.94		-45.97%
sync-randread
 read		   5491.20	 66.66	  14664.00	288.29		167.05%
sync-randwrite
 write		   6396.13	 98.79	   2035.80	  8.17		-68.17%
sync-randrw
 read		   2927.30	115.81	   1036.08	 34.67		-64.61%
 write		   2926.47	116.45	   1036.11	 34.90		-64.60%
libaio-read
 read		    254.36	 12.49	    258.63	 11.29		  1.68%
libaio-write
 write		   4979.20	122.75	   2904.77	 17.93		-41.66%
libaio-readwrite
 read		   2738.57	142.72	   2045.80	  4.12		-25.30%
 write		   2729.93	141.80	   2039.77	  3.79		-25.28%
libaio-randread
 read		    113.63	  2.98	    210.63	  5.07		 85.37%
libaio-randwrite
 write		   4456.10	 76.21	   1649.63	  7.00		-62.98%
libaio-randrw
 read		     97.85	  8.03	    877.49	 28.27		796.80%
 write		     97.55	  7.99	    874.83	 28.19		796.77%
mmap-read
 read		  20654.67	304.48	  24696.33	1064.07		 19.57%
mmap-write
 write		   8652.33	272.44	  13187.33	499.10		 52.41%
mmap-readwrite
 read		   6620.57	 16.05	   9221.60	399.56		 39.29%
 write		   6623.63	 16.34	   9222.13	399.31		 39.23%
mmap-randread
 read		   6717.23	1360.55	  21939.33	326.38		226.61%
mmap-randwrite
 write		   3204.63	253.66	  12371.00	 61.49		286.03%
mmap-randrw
 read		   2150.50	 78.00	   7682.67	188.59		257.25%
 write		   2149.50	 78.00	   7685.40	188.35		257.54%

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html