Re: [PATCH v2] perf script python: integrate page reclaim analyze script

Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> · Tue, 1 Oct 2019 15:45:24 +0100

On Mon, Sep 30, 2019 at 11:19:44PM -0400, Yafang Shao wrote:
> A new perf script page-reclaim is introduced in this patch. This new script
> is used to report the page reclaim details. The possible usage of this
> script is as bellow,
> - identify latency spike caused by direct reclaim
> - whehter the latency spike is relevant with pageout
> - why is page reclaim requested, i.e. whether it is because of memory
>   fragmentation
> - page reclaim efficiency
> etc
> In the future we may also enhance it to analyze the memcg reclaim.
> 

Hi,

I ended up not reviewing this patch in detail simply because I would
approach the same class of problem in an entirely different way today.
There is value in accumulating the stats in a report like this;

>     $ perf script report page-reclaim
>     Direct reclaims: 4924
>     Direct latency (ms)        total         max         avg         min
>         	          177823.211    6378.977      36.114       0.051
>     Direct file reclaimed 22920
>     Direct file scanned 28306
>     Direct file sync write I/O 0
>     Direct file async write I/O 0
>     Direct anon reclaimed 212567
>     Direct anon scanned 1446854
>     Direct anon sync write I/O 0
>     Direct anon async write I/O 278325
>     Direct order      0     1     3
>         	   4870    23    31
>     Wake kswapd requests 716
>     Wake order      0     1
>         	  715     1
> 
>     Kswapd reclaims: 9

However, the basic option I would prefer is having the raw latency
information for Direct latency that can be externally parsed by R or any
other statistical method. The reason why is because knowing the max latency
is not enough, I'd want to know the spread of latencies and whether they
were clustered at a point of time or spread out over long periods of
time. I would then build the higher-level reports on top if necessary.

Today, I would also have considered getting the latency figures using eBPF
or systemtap instead although having perf do it may be useful too. That's
not universally popular though so at minimum I would have;

perf script record page-reclaim -- capture all page-reclaim tracepoints
perf script report page-reclaim -- For reclaim entry/exit, merge the two
	tracepoints into one that reports latency. Dump the rest out
	verbatim

For latencies, I would externally post-process them until such time as I
found a common class of bug that needed a high-level report and then
build the perf script support for it.

Please note that I did not spot anything wrong with your script, it's
just that I would not use it myself in its current format for debugging
a reclaim-related problem.

-- 
Mel Gorman
SUSE Labs