On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote: > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote: > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote: > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote: > > > > This is fairly big but mostly red patch, which makes all non-root > > > > slab allocations use a single set of kmem_caches instead of > > > > creating a separate set for each memory cgroup. > > > > > > > > Because the number of non-root kmem_caches is now capped by the number > > > > of root kmem_caches, there is no need to shrink or destroy them > > > > prematurely. They can be perfectly destroyed together with their > > > > root counterparts. This allows to dramatically simplify the > > > > management of non-root kmem_caches and delete a ton of code. > > > > > > This is definitely going in the right direction. But it doesn't quite > > > explain why we still need two sets of kmem_caches? > > > > > > In the old scheme, we had completely separate per-cgroup caches with > > > separate slab pages. If a cgrouped process wanted to allocate a slab > > > object, we'd go to the root cache and used the cgroup id to look up > > > the right cgroup cache. On slab free we'd use page->slab_cache. > > > > > > Now we have slab pages that have a page->objcg array. Why can't all > > > allocations go through a single set of kmem caches? If an allocation > > > is coming from a cgroup and the slab page the allocator wants to use > > > doesn't have an objcg array yet, we can allocate it on the fly, no? > > > > Well, arguably it can be done, but there are few drawbacks: > > > > 1) On the release path you'll need to make some extra work even for > > root allocations: calculate the offset only to find the NULL objcg pointer. > > > > 2) There will be a memory overhead for root allocations > > (which might or might not be compensated by the increase > > of the slab utilization). > > Those two are only true if there is a wild mix of root and cgroup > allocations inside the same slab, and that doesn't really happen in > practice. Either the machine is dedicated to one workload and cgroups > are only enabled due to e.g. a vendor kernel, or you have cgrouped > systems (like most distro systems now) that cgroup everything. It's actually a questionable statement: we do skip allocations from certain contexts, and we do merge slab caches. Most likely it's true for certain slab_caches and not true for others. Think of kmalloc-* caches. Also, because obj_cgroup vectors will not be freed without underlying pages, most likely the percentage of pages with obj_cgroups will grow with uptime. In other words, memcg allocations will fragment root slab pages. > > > 3) I'm working on percpu memory accounting that resembles the same scheme, > > except that obj_cgroups vector is created for the whole percpu block. > > There will be root- and memcg-blocks, and it will be expensive to merge them. > > I kinda like using the same scheme here and there. > > It's hard to conclude anything based on this information alone. If > it's truly expensive to merge them, then it warrants the additional > complexity. But I don't understand the desire to share a design for > two systems with sufficiently different constraints. > > > Upsides? > > > > 1) slab utilization might increase a little bit (but I doubt it will have > > a huge effect, because both merging sets should be relatively big and well > > utilized) > > Right. > > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice, > > but there isn't so much code left anyway. > > There is a lot of complexity associated with the cache cloning that > isn't the lines of code, but the lifetime and synchronization rules. Quite opposite: the patchset removes all the complexity (or 90% of it), because it makes the kmem_cache lifetime independent from any cgroup stuff. Kmem_caches are created on demand on the first request (most likely during the system start-up), and destroyed together with their root counterparts (most likely never or on rmmod). First request means globally first request, not a first request from a given memcg. Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and after creation just matches the lifetime of the root kmem caches. The only reason to keep the async creation is that some kmem_caches are created very early in the boot process, long before any cgroup stuff is initialized. > > And these two things are the primary aspects that make my head hurt > trying to review this patch series. > > > So IMO it's an interesting direction to explore, but not something > > that necessarily has to be done in the context of this patchset. > > I disagree. Instead of replacing the old coherent model and its > complexities with a new coherent one, you are mixing the two. And I > can barely understand the end result. > > Dynamically cloning entire slab caches for the sole purpose of telling > whether the pages have an obj_cgroup array or not is *completely > insane*. If the controller had followed the obj_cgroup design from the > start, nobody would have ever thought about doing it like this. It's just not true. The whole point of having root- and memcg sets is to be able to not look for a NULL pointer in the obj_cgroup vector on releasing of the root object. In other words, it allows to keep zero overhead for root allocations. IMHO it's an important thing, and calling it *completely insane* isn't the best way to communicate. > > From a maintainability POV, we cannot afford merging it in this form. It sounds strange: the patchset eliminates 90% of the complexity, but it's unmergeable because there are 10% left. I agree that it's an arguable question if we can tolerate some additional overhead on root allocations to eliminate these additional 10%, but I really don't think it's so obvious that even discussing it is insane. Btw, there is another good idea to explore (also suggested by Christopher Lameter): we can put memcg/objcg pointer into the slab page, avoiding an extra allocation.