On 10/14/19 10:37 AM, Michal Hocko wrote: >> for_each_possible_cpu(cpu) >> x += per_cpu(pn->lruvec_stat_local->count[idx], cpu); >> >> It is costly looping through all the cpus to get the lru vec size info. >> And doing this on our workload with 96 cpu threads and 500 mem cgroups >> makes things much worse. We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs, >> which is a lot of data structures to be running through all the time. > Why does the number of cgroup matter? I was thinking purely of the cache footprint. If it's reading pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192 bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would be 18k of data for the whole system and the caching would be pretty efficient and all 18k would probably survive a tight page fault loop in the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't fit in the L1 and probably wouldn't survive a tight page fault loop if both logical threads were banging on different cgroups. It's just a theory, but it's why I noted the number of cgroups when I initially saw this show up in profiles.