> I’d like to add another real use case. > > In our company, we deploy applications using offline-online > hybrid deployment. This approach leverages the distinctive > resource utilization patterns of online services, utilizing idle > resources during various time periods by filling them with > offline jobs. This helps reduce the growing cost expenditures > for the enterprise. > > Whether for online services or offline jobs, their requirements > for THP can be roughly categorized into three types: > > * The first type aims to use huge pages as much as possible > and tolerates unpredictable stalls caused by direct reclaim > and/or compaction. > * The second type attempts to use huge pages but is relatively > latency-sensitive and cannot tolerate unpredictable stalls. > * The third type prefers not to use huge pages at all and is > extremely latency-sensitive. > > After careful consideration, we decided to prioritize the > requirements of the first type and modify the THP settings > as follows: > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > echo defer >/sys/kernel/mm/transparent_hugepage/defrag > > With the introduction of MADV_COLLAPSE into the kernel, > it is no longer dependent on any sysfs setting under > /sys/kernel/mm/transparent_hugepage. MADV_COLLAPSE > offers the potential for fine-grained synchronous control over > the huge page allocation mechanism, marking a significant > enhancement for THP. > > If the kernel supports a more relaxed (opportunistic) > MADV_COLLAPSE, we will modify the THP settings as follows: > > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled > echo madvise >/sys/kernel/mm/transparent_hugepage/defrag [corrected, via 2 previous mails, to: echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag] > Then, we will use process_madvise(MADV_COLLAPSE, xx_relaxed_flag) > to address the requirements of the second type. > > Why don't we favor madvise(MADV_COLLAPSE) for the first type > of requirements? > The main reason is that these requirements are typically for offline > jobs in the Hadoop ecosystem, such as MapReduce and Spark, > which run primarily on the JVM. [..] Hey Lance, Thanks for proving this context, it's very helpful. Though, couldn't you use enabled=always, defrag=defer+madvise, then just use prctl(PR_SET_THP_DISABLE) on type-3 workloads to get the behaviour you want? i.e. type 1: apply MADV_HUGEPAGE -> sync defrag to get THP type 2: don't apply MADV_HUGEPAGE -> use THP if available, kick kswapd+kcompactd otherwise type 3: use prctl(PR_SET_THP_DISABLE) (or MADV_NOHUGEPAGE) -> no THPs Or am I missing something? It sounds like a confounding issue is that these are external workloads, or you don't have ability to modify? But that would preclude MADV_COLLAPSE (unless you're using process_madvise()). Appreciate the help understanding the use case. I'm not opposed to the idea in general, but IMO would be great to have a clear need for it (and right now, we don't currently have alignment with the original motivating usecase (Go) in that regard w.r.t their plans). Thanks, Zach