Re: [RFC PATCH 1/2] psi: introduce memory.pressure.stat

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 3 Aug 2022 09:55:39 -0400

On Mon, Aug 01, 2022 at 12:42:04AM +0000, cgel.zte@xxxxxxxxx wrote:
> From: cgel <cgel@xxxxxxxxxx>
> 
> For now psi memory pressure account for all the mem stall in the
> system, And didnot provide a detailed information why the stall
> happens. This patch introduce a cgroupu knob memory.pressure.stat,
> it tells the detailed stall information of all memory events and it
> format and the corresponding proc interface.
> 
> for the cgroup, add memory.pressure.stat and it shows:
> kswapd: avg10=0.00 avg60=0.00 avg300=0.00 total=0
> direct reclaim: avg10=0.00 avg60=0.00 avg300=0.12 total=42356
> kcompacted: avg10=0.00 avg60=0.00 avg300=0.00 total=0
> direct compact: avg10=0.00 avg60=0.00 avg300=0.00 total=0
> cgroup reclaim: avg10=0.00 avg60=0.00 avg300=0.00 total=0
> workingset thrashing:   avg10=0.00 avg60=0.00 avg300=0.00 total=0
> 
> for the system wide, a proc file introduced as pressure/memory_stat
> and the format is the same as the cgroup interface.
> 
> With this detaled information, for example, if the system is stalled
> because of kcompacted, compaction_proactiveness can be promoted so
> pro-compaction can be involved earlier.
> 
> Signed-off-by: cgel <cgel@xxxxxxxxxx>

> @@ -64,9 +91,11 @@ struct psi_group_cpu {
>  
>  	/* Aggregate pressure state derived from the tasks */
>  	u32 state_mask;
> +	u32 state_memstall;
>  
>  	/* Period time sampling buckets for each state of interest (ns) */
>  	u32 times[NR_PSI_STATES];
> +	u32 times_mem[PSI_MEM_STATES];

This doubles the psi cache footprint on every context switch, wakeup,
sleep, etc. in the scheduler. You're also adding more branches to
those same paths. It'll measurably affect everybody who is using psi.

Yet, in the years of using psi in production myself, I've never felt
the need for what this patch provides. There are event counters for
everything that contributes to pressure, and it's never been hard to
rootcause spikes. There are also things like bpftrace that let you
identify who is stalling for how long in order to do one-off tuning
and systems introspection.

For this to get merged, it needs a better explanation of the usecase
that requires this information to be broadly available all the time.
And it needs to bring down the impact on everybody else who doesn't
want this - either by reducing the footprint or by making the feature
optional.