Re: THP backed thread stacks

Peter Xu <peterx@xxxxxxxxxx> · Mon, 6 Mar 2023 19:15:29 -0500

On Mon, Mar 06, 2023 at 03:57:30PM -0800, Mike Kravetz wrote:
> One of our product teams recently experienced 'memory bloat' in their
> environment.  The application in this environment is the JVM which
> creates hundreds of threads.  Threads are ultimately created via
> pthread_create which also creates the thread stacks.  pthread attributes
> are modified so that stacks are 2MB in size.  It just so happens that
> due to allocation patterns, all their stacks are at 2MB boundaries.  The
> system has THP always set, so a huge page is allocated at the first
> (write) fault when libpthread initializes the stack.
> 
> It would seem that this is expected behavior.  If you set THP always,
> you may get huge pages anywhere.
> 
> However, I can't help but think that backing stacks with huge pages by
> default may not be the right thing to do.  Stacks by their very nature
> grow in somewhat unpredictable ways over time.  Using a large virtual
> space so that memory is allocated as needed is the desired behavior.
> 
> The only way to address their 'memory bloat' via thread stacks today is
> by switching THP to madvise.
> 
> Just wondering if there is anything better or more selective that can be
> done?  Does it make sense to have THP backed stacks by default?  If not,
> who would be best at disabling?  A couple thoughts:
> - The kernel could disable huge pages on stacks.  libpthread/glibc pass
>   the unused flag MAP_STACK.  We could key off this and disable huge pages.
>   However, I'm sure there is somebody somewhere today that is getting better
>   performance because they have huge pages backing their stacks.
> - We could push this to glibc/libpthreads and have them use
>   MADV_NOHUGEPAGE on thread stacks.  However, this also has the potential
>   of regressing performance if somebody somewhere is getting better
>   performance due to huge pages.

Yes it seems it's always not safe to change a default behavior to me.

For stack I really can't tell why it must be different here.  I assume the
problem is the wasted space and it exaggerates easily with N-threads.  But
IIUC it'll be the same as thp to normal memories iiuc, e.g., there can be a
per-thread mmap() of 2MB even if only 4K is used each, then if such mmap()
is populated by THP for each thread there'll also be a huge waste.

> - Other thoughts?
> 
> Perhaps this is just expected behavior of THP always which is unfortunate
> in this situation.

I would think it's proper the app explicitly choose what it wants if
possible, and we do have the interfaces.

Then, would pthread_attr_getstack() plus MADV_NOHUGEPAGE work, which to be
applied from the JVM framework level?

Thanks,

-- 
Peter Xu