On Thu, Oct 13, 2016 at 08:53:28AM +0200, Michal Hocko wrote: > On Wed 12-10-16 09:09:49, Shaohua Li wrote: > > Our system uses significantly more slab memory with memcg enabled with > > latest kernel. With 3.10 kernel, slab uses 2G memory, while with 4.6 > > kernel, 6G memory is used. Looks the shrinker has problem. Let's see we > > have two memcg for one shrinker. In do_shrink_slab: > > > > 1. Check cg1. nr_deferred = 0, assume total_scan = 700. batch size is 1024, > > then no memory is freed. nr_deferred = 700 > > 2. Check cg2. nr_deferred = 700. Assume freeable = 20, then total_scan = 10 > > or 40. Let's assume it's 10. No memory is freed. nr_deferred = 10. > > > > The deferred share of cg1 is lost in this case. kswapd will free no > > memory even run above steps again and again. I agree this is possible. IMO the ideal way to fix this problem would be making deferred counters per memory cgroup. That would also resolve possible fairness issues when objects deferred by one cgroup are reclaimed from another. However, it's unclear to me how to implement it w/o bringing in a lot of awkward code. So I guess your patch is reasonable for now. Apart from a couple nitpicks (below), it looks good to me: Acked-by: Vladimir Davydov <vdavydov.dev@xxxxxxxxx> > > @@ -312,7 +313,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > > pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n", > > shrinker->scan_objects, total_scan); > > total_scan = freeable; > > - } > > + next_deferred = nr; > > + } else > > + next_deferred = total_scan; nitpick: Why do we want to handle this what-the-heck-is-going-on case in a special way? Why not just always assign total_scan to next_deferred here? I don't see how it could make things worse when total_scan gets screwed up. > > > > /* > > * We need to avoid excessive windup on filesystem shrinkers > > @@ -369,17 +372,22 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > > > > count_vm_events(SLABS_SCANNED, nr_to_scan); > > total_scan -= nr_to_scan; > > + scanned += nr_to_scan; nitpick: We could get along w/o 'scanned' here: next_deferred -= nr_to_scan; > > > > cond_resched(); > > } > > > > + if (next_deferred >= scanned) > > + next_deferred -= scanned; > > + else > > + next_deferred = 0; ... and this chunk wouldn't be needed then. > > /* > > * move the unused scan count back into the shrinker in a > > * manner that handles concurrent updates. If we exhausted the > > * scan, there is no need to do an update. > > */ > > - if (total_scan > 0) > > - new_nr = atomic_long_add_return(total_scan, > > + if (next_deferred > 0) > > + new_nr = atomic_long_add_return(next_deferred, > > &shrinker->nr_deferred[nid]); > > else > > new_nr = atomic_long_read(&shrinker->nr_deferred[nid]); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>