On 03/28/2014 12:42 AM, Greg Thelen wrote: > On Thu, Mar 27, 2014 at 12:37 AM, Vladimir Davydov > <vdavydov@xxxxxxxxxxxxx> wrote: >> On 03/27/2014 08:31 AM, Greg Thelen wrote: >>> Before this change both of the following allocations are charged to >>> memcg (assuming kmem accounting is enabled): >>> a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL) >>> b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL) >>> >>> After this change only 'a' is charged; 'b' goes directly to page >>> allocator which no longer does accounting. >> >> Why do we need to charge 'b' in the first place? Can the userspace >> trigger such allocations massively? If there can only be one or two such >> allocations from a cgroup, is there any point in charging them? > > Of the top of my head I don't know of any >8KIB kmalloc()s so I can't > say if they're directly triggerable by user space en masse. But we > recently ran into some order:3 allocations in networking. The > networking allocations used a non-generic kmem_cache (rather than > kmalloc which started this discussion). For details, see ed98df3361f0 > ("net: use __GFP_NORETRY for high order allocations"). I can't say if > such allocations exist in device drivers, but given the networking > example, it's conceivable that they may (or will) exist. Hmm, also not sure about device drivers, but the sock frag pages you mentioned are worth charging I guess. For such non-generic kmem allocations, we have two options: either go with __GFP_KMEMCG, or introduce special alloc/free_kmem_pages methods, which would be used instead of kmalloc for large allocations (e.g. threadinfo, sock frags). I vote for the second, because I dislike having kmemcg charging in the general allocation path. Anyway, that brings us back to the necessity to reliably track arbitrary pages in kmemcg to allow reparenting. > With slab this isn't a problem because sla has kmalloc kmem_caches for > all supported allocation sizes. However, slub shows this issue for > any kmalloc() allocations larger than 8KIB (at least on x86_64). It > seems like a strange directly to take kmem accounting to say that > kmalloc allocations are kmem limited, but only if they are either less > than a threshold size or done with slab. Simply increasing the size > of a data structure doesn't seem like it should automatically cause > the memory to become exempt from kmem limits. Sounds fair. >> In fact, do we actually need to charge every random kmem allocation? I >> guess not. For instance, filesystems often allocate data shared among >> all the FS users. It's wrong to charge such allocations to a particular >> memcg, IMO. That said the next step is going to be adding a per kmem >> cache flag specifying if allocations from this cache should be charged >> so that accounting will work only for those caches that are marked so >> explicitly. > > It's a question of what direction to approach kmem slab accounting > from: either opt-out (as the code currently is), or opt-in (with per > kmem_cache flags as you suggest). I agree that some structures end up > being shared (e.g. filesystem block bit map structures). In an > opt-out system these are charged to a memcg initially and remain > charged there until the memcg is deleted at which point the shared > objects are reparented to a shared location. While this isn't > perfect, it's unclear if it's better or worse than analyzing each > class of allocation and deciding if they should be opt'd-in. One > could (though I'm not) make the case that even dentries are easily > shareable between containers and thus shouldn't be accounted to a > single memcg. But given user space's ability to DoS a machine with > dentires, they should be accounted. Again, you must be right. After a bit of thinking I agree that deciding which caches should be accounted and which shouldn't would be cumbersome. Opt-out would be clearer, I guess. >> There is one more argument for removing kmalloc_large accounting - we >> don't have an easy way to track such allocations, which prevents us from >> reparenting kmemcg charges on css offline. Of course, we could link >> kmalloc_large pages in some sort of per-memcg list which would allow us >> to find them on css offline, but I don't think such a complication is >> justified. > > I assume that reparenting of such non kmem_cache allocations (e.g. > large kmalloc) is difficult because such pages refer to the memcg, > which we're trying to delete and the memcg has no index of such pages. > If such zombie memcg are undesirable, then an alternative to indexing > the pages is to define a kmem context object which such large pages > point to. The kmem context would be reparented without needing to > adjust the individual large pages. But there are plenty of options. I like the idea about the context object. For usual kmalloc'ed data, we already have one - the kmem_cache itself. For non-generic kmem (e.g. threadinfo pages), we could easily introduce a separate one with the pointer to the owning memcg on it. Reparenting wouldn't be a problem at all then. I guess I'll give it a try in the next iteration. Thank you! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>