On Wed, Oct 26, 2022 at 07:41:21PM -0700, Yosry Ahmed wrote: > On Wed, Oct 26, 2022 at 1:51 PM Yang Shi <shy828301@xxxxxxxxx> wrote: > > > > On Wed, Oct 26, 2022 at 10:32 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > > > On Tue, Oct 25, 2022 at 02:53:01PM -0700, Yang Shi wrote: > > > > On Tue, Oct 25, 2022 at 1:54 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > > > > > > > On Tue, Oct 25, 2022 at 12:40:15PM -0700, Yang Shi wrote: > > > > > > On Tue, Oct 25, 2022 at 10:05 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > > > > > > > > > > > Direct reclaim stats are useful for identifying a potential source for > > > > > > > application latency, as well as spotting issues with kswapd. However, > > > > > > > khugepaged currently distorts the picture: as a kernel thread it > > > > > > > doesn't impose allocation latencies on userspace, and it explicitly > > > > > > > opts out of kswapd reclaim. Its activity showing up in the direct > > > > > > > reclaim stats is misleading. Counting it as kswapd reclaim could also > > > > > > > cause confusion when trying to understand actual kswapd behavior. > > > > > > > > > > > > > > Break out khugepaged from the direct reclaim counters into new > > > > > > > pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. > > > > > > > > > > > > > > Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): > > > > > > > > > > > > > > pgsteal_kswapd 1342185 > > > > > > > pgsteal_direct 0 > > > > > > > pgsteal_khugepaged 3623 > > > > > > > pgscan_kswapd 1345025 > > > > > > > pgscan_direct 0 > > > > > > > pgscan_khugepaged 3623 > > > > > > > > > > > > There are other kernel threads or works may allocate memory then > > > > > > trigger memory reclaim, there may be similar problems for them and > > > > > > someone may try to add a new stat. So how's about we make the stats > > > > > > more general, for example, call it "pg{steal|scan}_kthread"? > > > > > > > > > > I'm not convinved that's a good idea. > > > > > > > > > > Can you generally say that userspace isn't indirectly waiting for one > > > > > of those allocating threads? With khugepaged, we know. > > > > > > > > AFAIK, ksm may do slab allocation with __GFP_DIRECT_RECLAIM. > > > > > > Right, but ksm also uses __GFP_KSWAPD_RECLAIM. So while userspace > > > isn't directly waiting for ksm, when ksm enters direct reclaim it's > > > because kswapd failed. This is of interest to kernel developers. > > > Userspace will likely see direct reclaim in that scenario as well, so > > > the ksm direct reclaim counts aren't liable to confuse users. > > > > > > Khugepaged on the other hand will *always* reclaim directly, even if > > > there is no memory pressure or kswapd failure. The direct reclaim > > > counts there are misleading to both developers and users. > > > > > > What it really should be is pgscan_nokswapd_nouserprocesswaiting, but > > > that just seems kind of long ;-) > > > > > > I'm also not sure anybody but khugepaged is doing direct reclaim > > > without kswapd reclaim. It seems unlikely we'll get more of those. > > > > IIUC you actually don't care about how many direct reclaim are > > triggered by khugepaged, but you would like to separate the direct > > reclaim stats between that are triggered directly by userspace > > actions, which may stall userspace, and that aren't, which don't stall > > userspace. If so it doesn't sound that important to distinguish > > whether the direct reclaim are triggered by khugepaged or other kernel > > threads even though other kthreads are not liable to confuse users > > IMHO. I feel like I've sufficiently explained my reason for wanting to separate out the __GFP_KSWAPD_RECLAIM special case from other sites. > My 2c, if we care about direct reclaim as in reclaim that may stall > user space application allocations, then there are other reclaim > contexts that may pollute the direct reclaim stats. For instance, > proactive reclaim, or reclaim done by writing a limit lower than the > current usage to memory.max or memory.high, as they are not done in > the context of the application allocating memory. > > At Google, we have some internal direct reclaim memcg statistics, and > the way we handle this is by passing a flag from such contexts to > try_to_free_mem_cgroup_pages() in the reclaim_options arg. This flag > is echod into a scan_struct bit, which we then use to filter out > direct reclaim operations that actually cause latencies in user space > allocations. > > Perhaps something similar might be more generic here? I am not sure > what context khugepaged reclaims memory from, but I think it's not a > memcg context, so maybe we want to generalize the reclaim_options arg > to try_to_free_pages() or whatever interface khugepaged uses to free > memory. So at the /proc/vmstat level, I'm not sure it matters much because it doesn't count any cgroup_reclaim() activity. But at the cgroup level, it sure would be nice to split out proactive reclaim churn. Both in terms of not polluting direct reclaim counts, but also for *knowing* how much proactive reclaim is doing. Do you have separate counters for this?