Re: [PATCH 0/8] add mTHP support for anonymous shmem

Daniel Gomez <da.gomez@xxxxxxxxxxx> · Thu, 9 May 2024 19:18:03 +0000

On Wed, May 08, 2024 at 07:03:57PM +0200, David Hildenbrand wrote:
> On 08.05.24 16:28, Daniel Gomez wrote:
> > On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
> > > On 08.05.24 13:39, Daniel Gomez wrote:
> > > > On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
> > > > > Anonymous pages have already been supported for multi-size (mTHP) allocation
> > > > > through commit 19eaf44954df, that can allow THP to be configured through the
> > > > > sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> > > > > 
> > > > > However, the anonymous shared pages will ignore the anonymous mTHP rule
> > > > > configured through the sysfs interface, and can only use the PMD-mapped
> > > > > THP, that is not reasonable. Many implement anonymous page sharing through
> > > > > mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
> > > > > therefore, users expect to apply an unified mTHP strategy for anonymous pages,
> > > > > also including the anonymous shared pages, in order to enjoy the benefits of
> > > > > mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
> > > > > than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.
> > > > > 
> > > > > The primary strategy is similar to supporting anonymous mTHP. Introduce
> > > > > a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> > > > > which can have all the same values as the top-level
> > > > > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> > > > > additional "inherit" option. By default all sizes will be set to "never"
> > > > > except PMD size, which is set to "inherit". This ensures backward compatibility
> > > > > with the shmem enabled of the top level, meanwhile also allows independent
> > > > > control of shmem enabled for each mTHP.
> > > > 
> > > > I'm trying to understand the adoption of mTHP and how it fits into the adoption
> > > > of (large) folios that the kernel is moving towards. Can you, or anyone involved
> > > > here, explain this? How much do they overlap, and can we benefit from having
> > > > both? Is there any argument against the adoption of large folios here that I
> > > > might have missed?
> > > 
> > > mTHP are implemented using large folios, just like traditional PMD-sized THP
> > > are. (you really should explore the history of mTHP and how it all works
> > > internally)
> > 
> > I'll check more in deep the code. By any chance are any of you going to be at
> > LSFMM this year? I have this session [1] scheduled for Wednesday and it would
> > be nice to get your feedback on it and if you see this working together with
> > mTHP/THP.
> > 
> 
> I'll be around and will attend that session! But note that I am still
> scratching my head what to do with "ordinary" shmem, especially because of
> the weird way shmem behaves in contrast to real files (below). Some input
> from Hugh might be very helpful.

I'm looking forward to meet you there and have your feedback!

> 
> Example: you write() to a shmem file and populate a 2M THP. Then, nobody
> touches that file for a long time. There are certainly other mmap() users
> that could better benefit from that THP ... and without swap that THP will
> be trapped there possibly a long time (unless I am missing an important
> piece of shmem THP design :) )? Sure, if we only have THP's it's nice,
> that's just not the reality unfortunately. IIRC, that's one of the reasons
> why THP for shmem can be enabled/disabled. But again, still scratching my
> head ...
> 
> 
> Note that this patch set only tackles anonymous shmem (MAP_SHARED|MAP_ANON),
> which is in 99.999% of all cases only accessed via page tables (memory
> allocated during page faults). I think there are ways to grab the fd
> (/proc/self/fd), but IIRC only corner cases read/write that.
> 
> So in that sense, anonymous shmem (this patch set) behaves mostly like
> ordinary anonymous memory, and likely there is not much overlap with other
> "allocate large folios during read/write/fallocate" as in [1]. swap might
> have an overlap.
> 
> 
> The real confusion begins when we have ordinary shmem: some users never mmap
> it and only read/write, some users never read/write it and only mmap it and
> some (less common?) users do both.
> 
> And shmem really is special: it looks like "just another file", but
> memory-consumption and reclaim wise it behaves just like anonymous memory.
> It might be swappable ("usually very limited backing disk space available")
> or it might not.
> 
> In a subthread here we are discussing what to do with that special
> "shmem_enabled = force" mode ... and it's all complicated I think.
> 
> > [1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/
> > 
> > > 
> > > The biggest challenge with memory that cannot be evicted on memory pressure
> > > to be reclaimed (in contrast to your ordinary files in the pagecache) is
> > > memory waste, well, and placement of large chunks of memory in general,
> > > during page faults.
> > > 
> > > In the worst case (no swap), you allocate a large chunk of memory once and
> > > it will stick around until freed: no reclaim of that memory.
> > 
> > I can see that path being triggered by some fstests but only for THP (where we
> > can actually reclaim memory).
> 
> Is that when we punch-hole a partial THP and split it? I'd be interested in
> what that test does.

The reclaim path I'm referring to is triggered when we reach max capacity
(-ENOSPC) in shmem_alloc_and_add_folio(). We reclaim space by splitting large
folios (regardless of their dirty or uptodate condition).

One of the tests that hits this path is generic/100 (with huge option enabled).
- First, it creates a directory structure in $TEMP_DIR (/tmp). Dir size is
around 26M.
- Then, it tars it up into $TEMP_DIR/temp.tar.
- Finally, untars the compressed file into $TEST_DIR (/media/test, which is the
huge tmpfs mountdir). What happens in generic/100 under the huge=always case
is that you fill up the dedicated space very quickly (this is 1G in xfstests
for tmpfs) and then you start reclaiming.

> 
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
>