On Wed, Jun 05, 2019 at 12:39:24AM -0700, Greg Thelen wrote: > Roman Gushchin <guro@xxxxxx> wrote: > > > # Why do we need this? > > > > We've noticed that the number of dying cgroups is steadily growing on most > > of our hosts in production. The following investigation revealed an issue > > in userspace memory reclaim code [1], accounting of kernel stacks [2], > > and also the mainreason: slab objects. > > > > The underlying problem is quite simple: any page charged > > to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless > > all charged pages are gone. If a slab object is actively used by other cgroups, > > it won't be reclaimed, and will prevent the origin cgroup from being reclaimed. > > > > Slab objects, and first of all vfs cache, is shared between cgroups, which are > > using the same underlying fs, and what's even more important, it's shared > > between multiple generations of the same workload. So if something is running > > periodically every time in a new cgroup (like how systemd works), we do > > accumulate multiple dying cgroups. > > > > Strictly speaking pagecache isn't different here, but there is a key difference: > > we disable protection and apply some extra pressure on LRUs of dying cgroups, > > and these LRUs contain all charged pages. > > My experiments show that with the disabled kernel memory accounting the number > > of dying cgroups stabilizes at a relatively small number (~100, depends on > > memory pressure and cgroup creation rate), and with kernel memory accounting > > it grows pretty steadily up to several thousands. > > > > Memory cgroups are quite complex and big objects (mostly due to percpu stats), > > so it leads to noticeable memory losses. Memory occupied by dying cgroups > > is measured in hundreds of megabytes. I've even seen a host with more than 100Gb > > of memory wasted for dying cgroups. It leads to a degradation of performance > > with the uptime, and generally limits the usage of cgroups. > > > > My previous attempt [3] to fix the problem by applying extra pressure on slab > > shrinker lists caused a regressions with xfs and ext4, and has been reverted [4]. > > The following attempts to find the right balance [5, 6] were not successful. > > > > So instead of trying to find a maybe non-existing balance, let's do reparent > > the accounted slabs to the parent cgroup on cgroup removal. > > > > > > # Implementation approach > > > > There is however a significant problem with reparenting of slab memory: > > there is no list of charged pages. Some of them are in shrinker lists, > > but not all. Introducing of a new list is really not an option. > > > > But fortunately there is a way forward: every slab page has a stable pointer > > to the corresponding kmem_cache. So the idea is to reparent kmem_caches > > instead of slab pages. > > > > It's actually simpler and cheaper, but requires some underlying changes: > > 1) Make kmem_caches to hold a single reference to the memory cgroup, > > instead of a separate reference per every slab page. > > 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use > > page->kmem_cache->memcg indirection instead. It's used only on > > slab page release, so it shouldn't be a big issue. > > 3) Introduce a refcounter for non-root slab caches. It's required to > > be able to destroy kmem_caches when they become empty and release > > the associated memory cgroup. > > > > There is a bonus: currently we do release empty kmem_caches on cgroup > > removal, however all other are waiting for the releasing of the memory cgroup. > > These refactorings allow kmem_caches to be released as soon as they > > become inactive and free. > > > > Some additional implementation details are provided in corresponding > > commit messages. > > > > # Results > > > > Below is the average number of dying cgroups on two groups of our production > > hosts. They do run some sort of web frontend workload, the memory pressure > > is moderate. As we can see, with the kernel memory reparenting the number > > stabilizes in 60s range; however with the original version it grows almost > > linearly and doesn't show any signs of plateauing. The difference in slab > > and percpu usage between patched and unpatched versions also grows linearly. > > In 7 days it exceeded 200Mb. > > > > day 0 1 2 3 4 5 6 7 > > original 56 362 628 752 1070 1250 1490 1560 > > patched 23 46 51 55 60 57 67 69 > > mem diff(Mb) 22 74 123 152 164 182 214 241 > > No objection to the idea, but a question... Hi Greg! > In patched kernel, does slabinfo (or similar) show the list reparented > slab caches? A pile of zombie kmem_caches is certainly better than a > pile of zombie mem_cgroup. But it still seems like it'll might cause > degradation - does cache_reap() walk an ever growing set of zombie > caches? It's not a pile of zombie kmem_caches vs a pile of zombie mem_cgroups. It's a smaller pile of zombie kmem_caches vs a larger pile of zombie kmem_caches *and* a pile of zombie mem_cgroups. The patchset makes the number of zombie kmem_caches lower, not bigger. Re slabinfo and other debug interfaces: I do not change anything here. > > We've found it useful to add a slabinfo_full file which includes zombie > kmem_cache with their memcg_name. This can help hunt down zombies. I'm not sure we need to add a permanent debug interface, because something like drgn ( https://github.com/osandov/drgn ) can be used instead. If you think that we lack some necessary debug interfaces, I'm totally open here, but it's not a part of this patchset. Let's talk about them separately. Thank you for looking into it! Roman