Hi Michal, On Wed, 25 Nov 2020 14:37:40 +0100 Michal Hocko <mhocko@xxxxxxxx> wrote: > Hi, > thanks for the detailed report. > > On Wed 25-11-20 12:39:56, Bruno Prémont wrote: > [...] > > Did memory.low meaning change between 5.7 and 5.9? > > The latest semantic change in the low limit protection semantic was > introduced in 5.7 (recursive protection) but it requires an explicit > enablinig. No specific mount options set for v2 cgroup, so not active. > > From behavior it > > feels as if inodes are not accounted to cgroup at all and kernel pushes > > cgroups down to their memory.low by killing file cache if there is not > > enough free memory to hold all promises (and not only when a cgroup > > tries to use up to its promised amount of memory). > > Your counters indeed show that the low protection has been breached, > most likely because the reclaim couldn't make any progress. Considering > that this is the case for all/most of your cgroups it suggests that the > memory pressure was global rather than limit imposed. In fact even top > level cgroups got reclaimed below the low limit. Note that the "original" counters we partially triggered by a first event where I had one cgroup (websrv) of the with a rather very high memory.low (16G or even 32G) which caused counters everywhere to increase. So before the last trashing during which the values were collected the event counters and `current` looked as follows: system/memory.pressure some avg10=0.04 avg60=0.28 avg300=0.12 total=5844917510 full avg10=0.04 avg60=0.26 avg300=0.11 total=2439353404 system/memory.current 96432128 system/memory.events.local low 5399469 (unchanged) high 0 max 112303 (unchanged) oom 0 oom_kill 0 system/base/memory.pressure some avg10=0.04 avg60=0.28 avg300=0.12 total=4589562039 full avg10=0.04 avg60=0.28 avg300=0.12 total=1926984197 system/base/memory.current 59305984 system/base/memory.events.local low 0 (unchanged) high 0 max 0 (unchanged) oom 0 oom_kill 0 system/backup/memory.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=2123293649 full avg10=0.00 avg60=0.00 avg300=0.00 total=815450446 system/backup/memory.current 32444416 system/backup/memory.events.local low 5446 (unchanged) high 0 max 0 oom 0 oom_kill 0 system/shell/memory.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=1345965660 full avg10=0.00 avg60=0.00 avg300=0.00 total=492812915 system/shell/memory.current 4571136 system/shell/memory.events.local low 0 high 0 max 0 oom 0 oom_kill 0 website/memory.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=415008878 full avg10=0.00 avg60=0.00 avg300=0.00 total=201868483 website/memory.current 12104380416 website/memory.events.local low 11264569 (during trashing: 11372142 then 11377350) high 0 max 0 oom 0 oom_kill 0 remote/memory.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=2005130126 full avg10=0.00 avg60=0.00 avg300=0.00 total=735366752 remote/memory.current 116330496 remote/memory.events.local low 11264569 (during trashing: 11372142 then 11377350) high 0 max 0 oom 0 oom_kill 0 websrv/memory.pressure some avg10=0.02 avg60=0.11 avg300=0.03 total=6650355162 full avg10=0.02 avg60=0.11 avg300=0.03 total=2034584579 websrv/memory.current 18483359744 websrv/memory.events.local low 0 high 0 max 0 oom 0 oom_kill 0 > This suggests that this is not likely to be memcg specific. It is > more likely that this is a general memory reclaim regression for your > workload. There were larger changes in that area. Be it lru balancing > based on cost model by Johannes or working set tracking for anonymous > pages by Joonsoo. Maybe even more. Both of them can influence page cache > reclaim but you are suggesting that slab accounted memory is not > reclaimed properly. That is my impression, yes. No idea though if memcg can influence the way reclaim tries to perform its work or if slab_reclaimable not associated to any (child) cg would somehow be excluded from reclaim. > I am not sure sure there were considerable changes > there. Would it be possible to collect /prov/vmstat as well? I will have a look at gathering memory.stat and /proc/vmstat at next opportunity. Will first try with a test system with not too much memory and lots of files to reproduce about 50% of memory usage by slab_reclaimable and see how far I get. Thanks, Bruno