Re: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise()

Lance Yang <ioworker0@xxxxxxxxx> · Sat, 27 Jan 2024 16:03:27 +0800

Hey Zach,

Thanks for taking time to look into this!

On Sat, Jan 27, 2024 at 7:47 AM Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote:
>
> > I’d like to add another real use case.
> >
> > In our company, we deploy applications using offline-online
> > hybrid deployment. This approach leverages the distinctive
> > resource utilization patterns of online services, utilizing idle
> > resources during various time periods by filling them with
> > offline jobs. This helps reduce the growing cost expenditures
> > for the enterprise.
> >
> > Whether for online services or offline jobs, their requirements
> > for THP can be roughly categorized into three types:
> >
> > * The first type aims to use huge pages as much as possible
> > and tolerates unpredictable stalls caused by direct reclaim
> > and/or compaction.
> > * The second type attempts to use huge pages but is relatively
> > latency-sensitive and cannot tolerate unpredictable stalls.
> > * The third type prefers not to use huge pages at all and is
> > extremely latency-sensitive.
> >
> > After careful consideration, we decided to prioritize the
> > requirements of the first type and modify the THP settings
> > as follows:
> >
> > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > echo defer >/sys/kernel/mm/transparent_hugepage/defrag
> >
> > With the introduction of MADV_COLLAPSE into the kernel,
> > it is no longer dependent on any sysfs setting under
> > /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE
> > offers the potential for fine-grained synchronous control over
> > the huge page allocation mechanism, marking a significant
> > enhancement for THP.
> >
> > If the kernel supports a more relaxed (opportunistic)
> > MADV_COLLAPSE, we will modify the THP settings as follows:
> >
> > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
>
> [corrected, via 2 previous mails, to: echo madvise
> >/sys/kernel/mm/transparent_hugepage/enabled
> echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag]
>
>
> > Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag)
> > to address the requirements of the second type.
> >
> > Why don't we favor madvise(MADV_COLLAPSE) for the first type
> > of requirements?
> > The main reason is that these requirements are typically for offline
> > jobs in the Hadoop ecosystem, such as MapReduce and Spark,
> > which run primarily on the JVM. [..]
>
> Hey Lance,
>
> Thanks for proving this context, it's very helpful.
>
> Though, couldn't you use enabled=always, defrag=defer+madvise, then
> just use prctl(PR_SET_THP_DISABLE) on type-3 workloads to get the
> behaviour you want? i.e.

prctl(PR_SET_THP_DISABLE) is a good choice that can fully meet
the needs of type-3 workloads.

I might prefer using enabled=madvise, as this would allow
applications to implement specific calls to madvise to request huge
pages selectively. If we set enabled=always, some applications
may not be optimized for or may not benefit from huge pages.
In such cases, using huge pages for all allocations could lead
to suboptimal performance.

>
> type 1: apply MADV_HUGEPAGE -> sync defrag to get THP
> type 2: don't apply MADV_HUGEPAGE -> use THP if available, kick
> kswapd+kcompactd otherwise

Sorry, I did not express myself clearly. The type 2 of requirements
should be:
type 2: apply MADV_HUGEPAGE with defrag=defer, or use a more
relaxed (opportunistic) MADV_COLLAPSE.

> type 3: use prctl(PR_SET_THP_DISABLE) (or MADV_NOHUGEPAGE) -> no THPs
>
> Or am I missing something? It sounds like a confounding issue is that
> these are external workloads, or you don't have ability to modify? But
> that would preclude MADV_COLLAPSE (unless you're using
> process_madvise()).

Sorry, my previous explanation has been unclear. What I meant is
that the requirements of type-1 workloads can be independent of
any sysfs setting and can be addressed using madvise(MADV_COLLAPSE).
In this scenario, why haven't I utilized it? The reason is that I
currently lack the capability to modify the JVM or PyTorch to
make them compatible with madvise(MADV_COLLAPSE).
Therefore, the needs of type-1 workloads still rely on sysfs settings.

>
> Appreciate the help understanding the use case. I'm not opposed to the
> idea in general, but IMO would be great to have a clear need for it

I appreciate your perspective!

Thanks again for your valuable insights and your suggestions!
Lance

> (and right now, we don't currently have alignment with the original
> motivating usecase (Go) in that regard w.r.t their plans).
>
> Thanks,
> Zach