Hey Zach, Thanks for taking time to look into this! On Sat, Jan 27, 2024 at 7:47 AM Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote: > > > I’d like to add another real use case. > > > > In our company, we deploy applications using offline-online > > hybrid deployment. This approach leverages the distinctive > > resource utilization patterns of online services, utilizing idle > > resources during various time periods by filling them with > > offline jobs. This helps reduce the growing cost expenditures > > for the enterprise. > > > > Whether for online services or offline jobs, their requirements > > for THP can be roughly categorized into three types: > > > > * The first type aims to use huge pages as much as possible > > and tolerates unpredictable stalls caused by direct reclaim > > and/or compaction. > > * The second type attempts to use huge pages but is relatively > > latency-sensitive and cannot tolerate unpredictable stalls. > > * The third type prefers not to use huge pages at all and is > > extremely latency-sensitive. > > > > After careful consideration, we decided to prioritize the > > requirements of the first type and modify the THP settings > > as follows: > > > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > > echo defer >/sys/kernel/mm/transparent_hugepage/defrag > > > > With the introduction of MADV_COLLAPSE into the kernel, > > it is no longer dependent on any sysfs setting under > > /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE > > offers the potential for fine-grained synchronous control over > > the huge page allocation mechanism, marking a significant > > enhancement for THP. > > > > If the kernel supports a more relaxed (opportunistic) > > MADV_COLLAPSE, we will modify the THP settings as follows: > > > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > > echo madvise >/sys/kernel/mm/transparent_hugepage/defrag > > [corrected, via 2 previous mails, to: echo madvise > >/sys/kernel/mm/transparent_hugepage/enabled > echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag] > > > > Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag) > > to address the requirements of the second type. > > > > Why don't we favor madvise(MADV_COLLAPSE) for the first type > > of requirements? > > The main reason is that these requirements are typically for offline > > jobs in the Hadoop ecosystem, such as MapReduce and Spark, > > which run primarily on the JVM. [..] > > Hey Lance, > > Thanks for proving this context, it's very helpful. > > Though, couldn't you use enabled=always, defrag=defer+madvise, then > just use prctl(PR_SET_THP_DISABLE) on type-3 workloads to get the > behaviour you want? i.e. prctl(PR_SET_THP_DISABLE) is a good choice that can fully meet the needs of type-3 workloads. I might prefer using enabled=madvise, as this would allow applications to implement specific calls to madvise to request huge pages selectively. If we set enabled=always, some applications may not be optimized for or may not benefit from huge pages. In such cases, using huge pages for all allocations could lead to suboptimal performance. > > type 1: apply MADV_HUGEPAGE -> sync defrag to get THP > type 2: don't apply MADV_HUGEPAGE -> use THP if available, kick > kswapd+kcompactd otherwise Sorry, I did not express myself clearly. The type 2 of requirements should be: type 2: apply MADV_HUGEPAGE with defrag=defer, or use a more relaxed (opportunistic) MADV_COLLAPSE. > type 3: use prctl(PR_SET_THP_DISABLE) (or MADV_NOHUGEPAGE) -> no THPs > > Or am I missing something? It sounds like a confounding issue is that > these are external workloads, or you don't have ability to modify? But > that would preclude MADV_COLLAPSE (unless you're using > process_madvise()). Sorry, my previous explanation has been unclear. What I meant is that the requirements of type-1 workloads can be independent of any sysfs setting and can be addressed using madvise(MADV_COLLAPSE). In this scenario, why haven't I utilized it? The reason is that I currently lack the capability to modify the JVM or PyTorch to make them compatible with madvise(MADV_COLLAPSE). Therefore, the needs of type-1 workloads still rely on sysfs settings. > > Appreciate the help understanding the use case. I'm not opposed to the > idea in general, but IMO would be great to have a clear need for it I appreciate your perspective! Thanks again for your valuable insights and your suggestions! Lance > (and right now, we don't currently have alignment with the original > motivating usecase (Go) in that regard w.r.t their plans). > > Thanks, > Zach