On Mon, Feb 03, 2020 at 05:39:54PM -0500, Johannes Weiner wrote: > On Mon, Feb 03, 2020 at 02:28:53PM -0800, Roman Gushchin wrote: > > On Mon, Feb 03, 2020 at 03:34:50PM -0500, Johannes Weiner wrote: > > > On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote: > > > > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote: > > > > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote: > > > > > > Currently s8 type is used for per-cpu caching of per-node statistics. > > > > > > It works fine because the overfill threshold can't exceed 125. > > > > > > > > > > > > But if some counters are in bytes (and the next commit in the series > > > > > > will convert slab counters to bytes), it's not gonna work: > > > > > > value in bytes can easily exceed s8 without exceeding the threshold > > > > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking > > > > > > vmstats correctness, let's use s32 instead. > > > > > > > > > > > > This doesn't affect per-zone statistics. There are no plans to use > > > > > > zone-level byte-sized counters, so no reasons to change anything. > > > > > > > > > > Wait, is this still necessary? AFAIU, the node counters will account > > > > > full slab pages, including free space, and only the memcg counters > > > > > that track actual objects will be in bytes. > > > > > > > > > > Can you please elaborate? > > > > > > > > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B) > > > > being in different units depending on the accounting scope. > > > > So I do convert all slab counters: global, per-lruvec, > > > > and per-memcg to bytes. > > > > > > Since the node counters tracks allocated slab pages and the memcg > > > counter tracks allocated objects, arguably they shouldn't use the same > > > name anyway. > > > > > > > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec > > > > NR_SLAB_RECLAIMABLE_OBJ > > > > NR_SLAB_UNRECLAIMABLE_OBJ > > > > > > Can we alias them and reuse their slots? > > > > > > /* Reuse the node slab page counters item for charged objects */ > > > MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE, > > > MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE, > > > > Yeah, lgtm. > > > > Isn't MEMCG_ prefix bad because it makes everybody think that it belongs to > > the enum memcg_stat_item? > > Maybe, not sure that's a problem. #define CG_SLAB_RECLAIMABLE perhaps? Maybe not. I'll probably go with MEMCG_SLAB_RECLAIMABLE_B = NR_SLAB_RECLAIMABLE, MEMCG_SLAB_UNRECLAIMABLE_B = NR_SLAB_UNRECLAIMABLE, Please, let me know if you're not ok with it. > > > > > and keep global counters untouched. If going this way, I'd prefer to make > > > > them per-memcg, because it will simplify things on charging paths: > > > > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(), > > > > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to > > > > bump per-lruvec counters. > > > > > > I don't quite follow. Don't you still have to update the global > > > counters? > > > > Global counters are updated only if an allocation requires a new slab > > page, which isn't the most common path. > > Right. > > > In generic case post_hook is required because it's the only place where > > we have both page (to get the node) and memcg pointer. > > > > If NR_SLAB_RECLAIMABLE is tracked only per-memcg (as MEMCG_SOCK), > > then post_hook can handle only the rare "allocation failed" case. > > > > I'm not sure here what's better. > > If it's tracked only per-memcg, you still have to account it every > time you charge an object to a memcg, no? How is it less frequent than > acconting at the lruvec level? It's not less frequent, it just can be done in the pre-alloc hook when there is a memcg pointer available. The problem with the obj_cgroup thing is that we get it indirectly from current memcg in the pre_alloc_hook, then pass it to obj_cgroup API, internally we might need to get the memcg from it to charge a page, and then again in the post_hook we need to get memcg to bump per-lruvec stats. In other words we make several memcg <-> objcg conversions, which isn't very nice on the hot path. I see that in the future we might optimize the initial lookup of objcg, but getting memcg just to bump vmstats looks unnecessarily expensive. One option I think about is to handle byte-sized stats on obj_cgroup level and flush whole pages to memcg level. > > > > > Btw, I wonder if we really need per-lruvec counters at all (at least > > > > being enabled by default). For the significant amount of users who > > > > have a single-node machine it doesn't bring anything except performance > > > > overhead. > > > > > > Yeah, for single-node systems we should be able to redirect everything > > > to the memcg counters, without allocating and tracking lruvec copies. > > > > Sounds good. It can lead to significant savings on single-node machines. > > > > > > > > > For those who have multiple nodes (and most likely many many > > > > memory cgroups) it provides way too many data except for debugging > > > > some weird mm issues. > > > > I guess in the absolute majority of cases having global per-node + per-memcg > > > > counters will be enough. > > > > > > Hm? Reclaim uses the lruvec counters. > > > > Can you, please, provide some examples? It looks like it's mostly based > > on per-zone lruvec size counters. > > It uses the recursive lruvec state to decide inactive_is_low(), > whether refaults are occuring, whether to trim cache only or go for > anon etc. We use it to determine refault distances and how many shadow > nodes to shrink. > > Grep for lruvec_page_state(). I see... Thanks! > > > Anyway, it seems to be a little bit off from this patchset, so let's > > discuss it separately. > > True