On Mon 14-10-19 10:49:49, Dave Hansen wrote: > On 10/14/19 10:37 AM, Michal Hocko wrote: > >> for_each_possible_cpu(cpu) > >> x += per_cpu(pn->lruvec_stat_local->count[idx], cpu); > >> > >> It is costly looping through all the cpus to get the lru vec size info. > >> And doing this on our workload with 96 cpu threads and 500 mem cgroups > >> makes things much worse. We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs, > >> which is a lot of data structures to be running through all the time. > > Why does the number of cgroup matter? > > I was thinking purely of the cache footprint. If it's reading > pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192 > bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would > be 18k of data for the whole system and the caching would be pretty > efficient and all 18k would probably survive a tight page fault loop in > the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't > fit in the L1 and probably wouldn't survive a tight page fault loop if > both logical threads were banging on different cgroups. > > It's just a theory, but it's why I noted the number of cgroups when I > initially saw this show up in profiles. Yes, the cache traffic might be really high but I still find it a bit surprising that it makes such a large footprint because this should be mostly called from slow paths (reclaim) and the real work done should just be larger - at least that's my intuition which might be quite off here. How much is that 25% of the system time in the total time btw? -- Michal Hocko SUSE Labs