On Mon 10-07-23 14:53:25, Huang Ying wrote: > To auto-tune PCP high for each CPU automatically, an > allocation/freeing depth based PCP high auto-tuning algorithm is > implemented in this patch. > > The basic idea behind the algorithm is to detect the repetitive > allocation and freeing pattern with short enough period (about 1 > second). The period needs to be short to respond to allocation and > freeing pattern changes quickly and control the memory wasted by > unnecessary caching. 1s is an ethernity from the allocation POV. Is a time based sampling really a good choice? I would have expected a natural allocation/freeing feedback mechanism. I.e. double the batch size when the batch is consumed and it requires to be refilled and shrink it under memory pressure (GFP_NOWAIT allocation fails) or when the surplus grows too high over batch (e.g. twice as much). Have you considered something as simple as that? Quite honestly I am not sure time based approach is a good choice because memory consumptions tends to be quite bulky (e.g. application starts or workload transitions based on requests). > To detect the repetitive allocation and freeing pattern, the > alloc/free depth is calculated for each tuning period (1 second) on > each CPU. To calculate the alloc/free depth, we track the alloc > count. Which increases for page allocation from PCP and decreases for > page freeing to PCP. The alloc depth is the maximum alloc count > difference between the later large value and former small value. > While, the free depth is the maximum alloc count difference between > the former large value and the later small value. > > Then, the average alloc/free depth in multiple tuning periods is > calculated, with the old alloc/free depth decay in the average > gradually. > > Finally, the PCP high is set to be the smaller value of average alloc > depth and average free depth, after clamped between the default and > the max PCP high. In this way, pure allocation or freeing will not > enlarge the PCP high because PCP doesn't help. > > We have tested the algorithm with several workloads on Intel's > 2-socket server machines. How does this scheme deal with memory pressure? -- Michal Hocko SUSE Labs