Re: [PATCH, RFC 00/16] Transparent huge page cache

Wanpeng Li <liwanp@xxxxxxxxxxxxxxxxxx> · Sun, 7 Apr 2013 08:26:19 +0800

On Fri, Apr 05, 2013 at 09:42:08AM +0800, Wanpeng Li wrote:
>On Wed, Jan 30, 2013 at 06:12:05PM -0800, Hugh Dickins wrote:
>>On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
>>> Hugh Dickins wrote:
>>> > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
>>> > > From: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
>>> > > 
>>> > > Here's first steps towards huge pages in page cache.
>>> > > 
>>> > > The intend of the work is get code ready to enable transparent huge page
>>> > > cache for the most simple fs -- ramfs.
>>> > > 
>>> > > It's not yet near feature-complete. It only provides basic infrastructure.
>>> > > At the moment we can read, write and truncate file on ramfs with huge pages in
>>> > > page cache. The most interesting part, mmap(), is not yet there. For now
>>> > > we split huge page on mmap() attempt.
>>> > > 
>>> > > I can't say that I see whole picture. I'm not sure if I understand locking
>>> > > model around split_huge_page(). Probably, not.
>>> > > Andrea, could you check if it looks correct?
>>> > > 
>>> > > Next steps (not necessary in this order):
>>> > >  - mmap();
>>> > >  - migration (?);
>>> > >  - collapse;
>>> > >  - stats, knobs, etc.;
>>> > >  - tmpfs/shmem enabling;
>>> > >  - ...
>>> > > 
>>> > > Kirill A. Shutemov (16):
>>> > >   block: implement add_bdi_stat()
>>> > >   mm: implement zero_huge_user_segment and friends
>>> > >   mm: drop actor argument of do_generic_file_read()
>>> > >   radix-tree: implement preload for multiple contiguous elements
>>> > >   thp, mm: basic defines for transparent huge page cache
>>> > >   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>>> > >   thp, mm: rewrite delete_from_page_cache() to support huge pages
>>> > >   thp, mm: locking tail page is a bug
>>> > >   thp, mm: handle tail pages in page_cache_get_speculative()
>>> > >   thp, mm: implement grab_cache_huge_page_write_begin()
>>> > >   thp, mm: naive support of thp in generic read/write routines
>>> > >   thp, libfs: initial support of thp in
>>> > >     simple_read/write_begin/write_end
>>> > >   thp: handle file pages in split_huge_page()
>>> > >   thp, mm: truncate support for transparent huge page cache
>>> > >   thp, mm: split huge page on mmap file page
>>> > >   ramfs: enable transparent huge page cache
>>> > > 
>>> > >  fs/libfs.c                  |   54 +++++++++---
>>> > >  fs/ramfs/inode.c            |    6 +-
>>> > >  include/linux/backing-dev.h |   10 +++
>>> > >  include/linux/huge_mm.h     |    8 ++
>>> > >  include/linux/mm.h          |   15 ++++
>>> > >  include/linux/pagemap.h     |   14 ++-
>>> > >  include/linux/radix-tree.h  |    3 +
>>> > >  lib/radix-tree.c            |   32 +++++--
>>> > >  mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>>> > >  mm/huge_memory.c            |   62 +++++++++++--
>>> > >  mm/memory.c                 |   22 +++++
>>> > >  mm/truncate.c               |   12 +++
>>> > >  12 files changed, 375 insertions(+), 67 deletions(-)
>>> > 
>>> > Interesting.
>>> > 
>>> > I was starting to think about Transparent Huge Pagecache a few
>>> > months ago, but then got washed away by incoming waves as usual.
>>> > 
>>> > Certainly I don't have a line of code to show for it; but my first
>>> > impression of your patches is that we have very different ideas of
>>> > where to start.
>>
>>A second impression confirms that we have very different ideas of
>>where to start.  I don't want to be dismissive, and please don't let
>>me discourage you, but I just don't find what you have very interesting.
>>
>>I'm sure you'll agree that the interesting part, and the difficult part,
>>comes with mmap(); and there's no point whatever to THPages without mmap()
>>(of course, I'm including exec and brk and shm when I say mmap there).
>>
>>(There may be performance benefits in working with larger page cache
>>size, which Christoph Lameter explored a few years back, but that's a
>>different topic: I think 2MB - if I may be x86_64-centric - would not be
>>the unit of choice for that, unless SSD erase block were to dominate.)
>>
>>I'm interested to get to the point of prototyping something that does
>>support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
>>a lot about my misconceptions, and have to rework for a while (or give
>>up!); but I don't see much point in posting anything without that.
>>I don't know if we have 5 or 50 places which "know" that a THPage
>>must be Anon: some I'll spot in advance, some I sadly won't.
>>
>>It's not clear to me that the infrastructural changes you make in this
>>series will be needed or not, if I pursue my approach: some perhaps as
>>optimizations on top of the poorly performing base that may emerge from
>>going about it my way.  But for me it's too soon to think about those.
>>
>>Something I notice that we do agree upon: the radix_tree holding the
>>4k subpages, at least for now.  When I first started thinking towards
>>THPageCache, I was fascinated by how we could manage the hugepages in
>>the radix_tree, cutting out unnecessary levels etc; but after a while
>>I realized that although there's probably nice scope for cleverness
>>there (significantly constrained by RCU expectations), it would only
>>be about optimization.  Let's be simple and stupid about radix_tree
>>for now, the problems that need to be worked out lie elsewhere.
>>
>>> > 
>>> > Perhaps that's good complementarity, or perhaps I'll disagree with
>>> > your approach.  I'll be taking a look at yours in the coming days,
>>> > and trying to summon back up my own ideas to summarize them for you.
>>> 
>>> Yeah, it would be nice to see alternative design ideas. Looking forward.
>>> 
>>> > Perhaps I was naive to imagine it, but I did intend to start out
>>> > generically, independent of filesystem; but content to narrow down
>>> > on tmpfs alone where it gets hard to support the others (writeback
>>> > springs to mind).  khugepaged would be migrating little pages into
>>> > huge pages, where it saw that the mmaps of the file would benefit
>>> > (and for testing I would hack mmap alignment choice to favour it).
>>> 
>>> I don't think all fs at once would fly, but it's wonderful, if I'm
>>> wrong :)
>>
>>You are imagining the filesystem putting huge pages into its cache.
>>Whereas I'm imagining khugepaged looking around at mmaped file areas,
>>seeing which would benefit from huge pagecache (let's assume offset 0
>>belongs on hugepage boundary - maybe one day someone will want to tune
>>some files or parts differently, but that's low priority), migrating 4k
>>pages over to 2MB page (wouldn't have to be done all in one pass), then
>>finally slotting in the pmds for that.
>>
>>But going this way, I expect we'd have to split at page_mkwrite():
>>we probably don't want a single touch to dirty 2MB at a time,
>>unless tmpfs or ramfs.
>>
>>> 
>>> > I had arrived at a conviction that the first thing to change was
>>> > the way that tail pages of a THP are refcounted, that it had been a
>>> > mistake to use the compound page method of holding the THP together.
>>> > But I'll have to enter a trance now to recall the arguments ;)
>>> 
>>> THP refcounting looks reasonable for me, if take split_huge_page() in
>>> account.
>>
>>I'm not claiming that the THP refcounting is wrong in what it's doing
>>at present; but that I suspect we'll want to rework it for THPageCache.
>>
>>Something I take for granted, I think you do too but I'm not certain:
>>a file with transparent huge pages in its page cache can also have small
>>pages in other extents of its page cache; and can be mapped hugely (2MB
>>extents) into one address space at the same time as individual 4k pages
>>from those extents are mapped into another (or the same) address space.
>>
>>One can certainly imagine sacrificing that principle, splitting whenever
>>there's such a "conflict"; but it then becomes uninteresting to me, too
>>much like hugetlbfs.  Splitting an anonymous hugepage in all address
>>spaces that hold it when one of them needs it split, that has been a
>>pragmatic strategy: it's not a common case for forks to diverge like
>>that; but files are expected to be more widely shared.
>>
>>At present THP is using compound pages, with mapcount of tail pages
>>reused to track their contribution to head page count; but I think we
>>shall want to be able to use the mapcount, and the count, of TH tail
>>pages for their original purpose if huge mappings can coexist with tiny.
>>Not fully thought out, but that's my feeling.
>>
>>The use of compound pages, in particular the redirection of tail page
>>count to head page count, was important in hugetlbfs: a get_user_pages
>>reference on a subpage must prevent the containing hugepage from being
>>freed, because hugetlbfs has its own separate pool of hugepages to
>>which freeing returns them.
>>
>>But for transparent huge pages?  It should not matter so much if the
>>subpages are freed independently.  So I'd like to devise another glue
>>to hold them together more loosely (for prototyping I can certainly
>>pretend we have infinite pageflag and pagefield space if that helps):
>>I may find in practice that they're forever falling apart, and I run
>>crying back to compound pages; but at present I'm hoping not.
>>
>>This mail might suggest that I'm about to start coding: I wish that
>>were true, but in reality there's always a lot of unrelated things
>>I have to look at, which dilute my focus.  So if I've said anything
>>that sparks ideas for you, go with them.

Hi Hugh,

commit 70b50f94f16 ("mm: thp: tail page refcounting fix") tells us account 
the tail page references on tail_page->_count wasn't safe.

Regards,
Wanpeng Li 

>
>It seems that it's a good idea, Hugh. I will start coding this. ;-)
>
>Regards,
>Wanpeng Li 
>
>>
>>Hugh
>>
>>--
>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
>>see: http://www.linux-mm.org/ .
>>Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>