On Tue, Feb 04, 2020 at 01:41:59PM -0500, Johannes Weiner wrote: > On Mon, Feb 03, 2020 at 08:35:41PM -0800, Roman Gushchin wrote: > > On Mon, Feb 03, 2020 at 09:47:04PM -0500, Johannes Weiner wrote: > > > On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote: > > > > On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote: > > > > > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote: > > > > > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote: > > > > > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote: > > > > > > > > This is fairly big but mostly red patch, which makes all non-root > > > > > > > > slab allocations use a single set of kmem_caches instead of > > > > > > > > creating a separate set for each memory cgroup. > > > > > > > > > > > > > > > > Because the number of non-root kmem_caches is now capped by the number > > > > > > > > of root kmem_caches, there is no need to shrink or destroy them > > > > > > > > prematurely. They can be perfectly destroyed together with their > > > > > > > > root counterparts. This allows to dramatically simplify the > > > > > > > > management of non-root kmem_caches and delete a ton of code. > > > > > > > > > > > > > > This is definitely going in the right direction. But it doesn't quite > > > > > > > explain why we still need two sets of kmem_caches? > > > > > > > > > > > > > > In the old scheme, we had completely separate per-cgroup caches with > > > > > > > separate slab pages. If a cgrouped process wanted to allocate a slab > > > > > > > object, we'd go to the root cache and used the cgroup id to look up > > > > > > > the right cgroup cache. On slab free we'd use page->slab_cache. > > > > > > > > > > > > > > Now we have slab pages that have a page->objcg array. Why can't all > > > > > > > allocations go through a single set of kmem caches? If an allocation > > > > > > > is coming from a cgroup and the slab page the allocator wants to use > > > > > > > doesn't have an objcg array yet, we can allocate it on the fly, no? > > > > > > > > > > > > Well, arguably it can be done, but there are few drawbacks: > > > > > > > > > > > > 1) On the release path you'll need to make some extra work even for > > > > > > root allocations: calculate the offset only to find the NULL objcg pointer. > > > > > > > > > > > > 2) There will be a memory overhead for root allocations > > > > > > (which might or might not be compensated by the increase > > > > > > of the slab utilization). > > > > > > > > > > Those two are only true if there is a wild mix of root and cgroup > > > > > allocations inside the same slab, and that doesn't really happen in > > > > > practice. Either the machine is dedicated to one workload and cgroups > > > > > are only enabled due to e.g. a vendor kernel, or you have cgrouped > > > > > systems (like most distro systems now) that cgroup everything. > > > > > > > > It's actually a questionable statement: we do skip allocations from certain > > > > contexts, and we do merge slab caches. > > > > > > > > Most likely it's true for certain slab_caches and not true for others. > > > > Think of kmalloc-* caches. > > > > > > With merging it's actually really hard to say how sparse or dense the > > > resulting objcgroup arrays would be. It could change all the time too. > > > > So here is some actual data from my dev machine. The first column is the number > > of pages in the root cache, the second - in the corresponding memcg. > > > > ext4_groupinfo_4k 1 0 > > rpc_inode_cache 1 0 > > fuse_request 62 0 > > fuse_inode 1 2732 > > btrfs_delayed_node 1192 0 > > btrfs_ordered_extent 129 0 > > btrfs_extent_map 8686 0 > > btrfs_extent_buffer 2648 0 > > btrfs_inode 12 6739 > > PINGv6 1 11 > > RAWv6 2 5 > > UDPv6 1 34 > > tw_sock_TCPv6 378 3 > > request_sock_TCPv6 24 0 > > TCPv6 46 74 > > mqueue_inode_cache 1 0 > > jbd2_journal_handle 2 0 > > jbd2_journal_head 2 0 > > jbd2_revoke_table_s 1 0 > > ext4_inode_cache 1 3 > > ext4_allocation_context 1 0 > > ext4_io_end 1 0 > > ext4_extent_status 5 0 > > mbcache 1 0 > > dnotify_struct 1 0 > > posix_timers_cache 24 0 > > xfrm_dst_cache 202 0 > > RAW 3 12 > > UDP 2 24 > > tw_sock_TCP 25 0 > > request_sock_TCP 24 0 > > TCP 7 24 > > hugetlbfs_inode_cache 2 0 > > dquot 2 0 > > eventpoll_pwq 1 119 > > dax_cache 1 0 > > request_queue 9 0 > > blkdev_ioc 241 0 > > biovec-max 112 0 > > biovec-128 2 0 > > biovec-64 6 0 > > khugepaged_mm_slot 248 0 > > dmaengine-unmap-256 1 0 > > dmaengine-unmap-128 1 0 > > dmaengine-unmap-16 39 0 > > sock_inode_cache 9 219 > > skbuff_ext_cache 249 0 > > skbuff_fclone_cache 83 0 > > skbuff_head_cache 138 141 > > file_lock_cache 24 0 > > net_namespace 1 5 > > shmem_inode_cache 14 56 > > task_delay_info 23 165 > > taskstats 24 0 > > proc_dir_entry 24 0 > > pde_opener 16 24 > > proc_inode_cache 24 1103 > > bdev_cache 4 20 > > kernfs_node_cache 1405 0 > > mnt_cache 54 0 > > filp 53 460 > > inode_cache 488 2287 > > dentry 367 10576 > > names_cache 24 0 > > ebitmap_node 2 0 > > avc_xperms_data 256 0 > > lsm_file_cache 92 0 > > buffer_head 24 9 > > uts_namespace 1 3 > > vm_area_struct 48 810 > > mm_struct 19 29 > > files_cache 14 26 > > signal_cache 28 143 > > sighand_cache 45 47 > > task_struct 77 430 > > cred_jar 29 424 > > anon_vma_chain 39 492 > > anon_vma 28 467 > > pid 30 369 > > Acpi-Operand 56 0 > > Acpi-Parse 5587 0 > > Acpi-State 4137 0 > > Acpi-Namespace 8 0 > > numa_policy 137 0 > > ftrace_event_field 68 0 > > pool_workqueue 25 0 > > radix_tree_node 1694 7776 > > task_group 21 0 > > vmap_area 477 0 > > kmalloc-rcl-512 473 0 > > kmalloc-rcl-256 605 0 > > kmalloc-rcl-192 43 16 > > kmalloc-rcl-128 1 47 > > kmalloc-rcl-96 3 229 > > kmalloc-rcl-64 6 611 > > kmalloc-8k 48 24 > > kmalloc-4k 372 59 > > kmalloc-2k 132 50 > > kmalloc-1k 251 82 > > kmalloc-512 360 150 > > kmalloc-256 237 0 > > kmalloc-192 298 24 > > kmalloc-128 203 24 > > kmalloc-96 112 24 > > kmalloc-64 796 24 > > kmalloc-32 1188 26 > > kmalloc-16 555 25 > > kmalloc-8 42 24 > > kmem_cache_node 20 0 > > kmem_cache 24 0 > > That's interesting, thanks. It does look fairly bimodal, except in > some smaller caches. Which does make sense when you think about it: we > focus on accounting consumers that are driven by userspace activity > and big enough to actually matter in terms of cgroup footprint. > > > > > Also, because obj_cgroup vectors will not be freed without underlying pages, > > > > most likely the percentage of pages with obj_cgroups will grow with uptime. > > > > In other words, memcg allocations will fragment root slab pages. > > > > > > I understand the first part of this paragraph, but not the second. The > > > objcgroup vectors will be freed when the slab pages get freed. But the > > > partially filled slab pages can be reused by any types of allocations, > > > surely? How would this cause the pages to fragment? > > > > I mean the following: once you allocate a single accounted object > > from the page, obj_cgroup vector is allocated and will be released only > > with the slab page. We really really don't want to count how many accounted > > objects are on the page and release obj_cgroup vector on reaching 0. > > So even if all following allocations are root allocations, the overhead > > will not go away with the uptime. > > > > In other words, even a small percentage of accounted objects will > > turn the whole cache into "accountable". > > Correct. The worst case is where we have a large cache that has N > objects per slab, but only ~1/N objects are accounted to a cgroup. > > The question is whether this is common or even realistic. When would a > cache be big, but only a small subset of its allocations would be > attributable to specific cgroups? > > On the less extreme overlapping cases, yeah there are fragmented > obj_cgroup arrays, but there is also better slab packing. One is an > array of pointers, the other is an array of larger objects. It would > seem slab fragmentation has the potential to waste much more memory? > > > > > > > 3) I'm working on percpu memory accounting that resembles the same scheme, > > > > > > except that obj_cgroups vector is created for the whole percpu block. > > > > > > There will be root- and memcg-blocks, and it will be expensive to merge them. > > > > > > I kinda like using the same scheme here and there. > > > > > > > > > > It's hard to conclude anything based on this information alone. If > > > > > it's truly expensive to merge them, then it warrants the additional > > > > > complexity. But I don't understand the desire to share a design for > > > > > two systems with sufficiently different constraints. > > > > > > > > > > > Upsides? > > > > > > > > > > > > 1) slab utilization might increase a little bit (but I doubt it will have > > > > > > a huge effect, because both merging sets should be relatively big and well > > > > > > utilized) > > > > > > > > > > Right. > > > > > > > > > > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice, > > > > > > but there isn't so much code left anyway. > > > > > > > > > > There is a lot of complexity associated with the cache cloning that > > > > > isn't the lines of code, but the lifetime and synchronization rules. > > > > > > > > Quite opposite: the patchset removes all the complexity (or 90% of it), > > > > because it makes the kmem_cache lifetime independent from any cgroup stuff. > > > > > > > > Kmem_caches are created on demand on the first request (most likely during > > > > the system start-up), and destroyed together with their root counterparts > > > > (most likely never or on rmmod). First request means globally first request, > > > > not a first request from a given memcg. > > > > > > > > Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and > > > > after creation just matches the lifetime of the root kmem caches. > > > > > > > > The only reason to keep the async creation is that some kmem_caches > > > > are created very early in the boot process, long before any cgroup > > > > stuff is initialized. > > > > > > Yes, it's independent of the obj_cgroup and memcg, and yes it's > > > simpler after your patches. But I'm not talking about the delta, I'm > > > trying to understand the end result. > > > > > > And the truth is there is a decent chunk of code and tentacles spread > > > throughout the slab/cgroup code to clone, destroy, and handle the > > > split caches, as well as the branches/indirections on every cgrouped > > > slab allocation. > > > > Did you see the final code? It's fairly simple and there is really not > > much of complexity left. If you don't think so, let's go into details, > > because otherwise it's hard to say anything. > > I have the patches applied to a local tree and am looking at the final > code. But I can only repeat that "it's not too bad" simply isn't a > good explanation for why the code is the way it is. > > > With a such change which basically removes the current implementation > > and replaces it with a new one, it's hard to keep the balance between > > making commits self-contained and small, but also showing the whole picture. > > I'm fully open to questions and generally want to make it simpler. > > > > I've tried to separate some parts and get them merged before the main > > thing, but they haven't been merged yet, so I have to include them > > to keep the thing building. > > > > Will a more-detailed design in the cover help? > > Will writing a design doc to put into Documentation/ help? > > Is it better to rearrange patches in a way to eliminate the current > > implementation first and build from scratch? > > It would help to have changelogs that actually describe how the new > design is supposed to work, and why you made the decisions you made. > > The changelog in this patch here sells the change as a reduction in > complexity, without explaining why it stopped where it stopped. So > naturally, if that's the declared goal, the first question is whether > we can make it simpler. > > Both the cover letter and the changelogs should focus less on what was > there and how it was deleted, and more on how the code is supposed to > work after the patches. How the components were designed and how they > all work together. > > As I said before, imagine somebody without any historical knowledge > reading the code. They should be able to find out why you chose to > have two sets of kmem caches. There is no explanation for it other > than "there used to be more, so we cut it down to two". > > > > > > And these two things are the primary aspects that make my head hurt > > > > > trying to review this patch series. > > > > > > > > > > > So IMO it's an interesting direction to explore, but not something > > > > > > that necessarily has to be done in the context of this patchset. > > > > > > > > > > I disagree. Instead of replacing the old coherent model and its > > > > > complexities with a new coherent one, you are mixing the two. And I > > > > > can barely understand the end result. > > > > > > > > > > Dynamically cloning entire slab caches for the sole purpose of telling > > > > > whether the pages have an obj_cgroup array or not is *completely > > > > > insane*. If the controller had followed the obj_cgroup design from the > > > > > start, nobody would have ever thought about doing it like this. > > > > > > > > It's just not true. The whole point of having root- and memcg sets is > > > > to be able to not look for a NULL pointer in the obj_cgroup vector on > > > > releasing of the root object. In other words, it allows to keep zero > > > > overhead for root allocations. IMHO it's an important thing, and calling > > > > it *completely insane* isn't the best way to communicate. > > > > > > But you're trading it for the indirection of going through a separate > > > kmem_cache for every single cgroup-accounted allocation. Why is this a > > > preferable trade-off to make? > > > > Because it allows to keep zero memory and cpu overhead for root allocations. > > I've no data showing that this overhead is small and acceptable in all cases. > > I think keeping zero overhead for root allocations is more important > > than having a single set of kmem caches. > > In the kmem cache breakdown you provided above, there are 35887 pages > allocated to root caches and 37300 pages allocated to cgroup caches. > > Why are root allocations supposed to be more important? Aren't some of > the hottest allocations tracked by cgroups? Look at fork(): > > > vm_area_struct 48 810 > > mm_struct 19 29 > > files_cache 14 26 > > signal_cache 28 143 > > sighand_cache 45 47 > > task_struct 77 430 > > cred_jar 29 424 > > anon_vma_chain 39 492 > > anon_vma 28 467 > > pid 30 369 > > Hard to think of much hotter allocations. They all have to suffer the > additional branch and cache footprint of the auxiliary cgroup caches. > > > > > I agree that it's an arguable question if we can tolerate some > > > > additional overhead on root allocations to eliminate these additional > > > > 10%, but I really don't think it's so obvious that even discussing > > > > it is insane. > > > > > > Well that's exactly my point. > > > > Ok, what's the acceptable performance penalty? > > Is adding 20% on free path is acceptable, for example? > > Or adding 3% of slab memory? > > I find offhand replies like these very jarring. > > There is a legitimate design question: Why are you using a separate > set of kmem caches for the cgroup allocations, citing the additional > complexity over having one set? And your reply was mostly handwaving. Johannes, I posted patches and numbers that shows that the patchset improves a fundamental kernel characteristic (slab utilization) by a meaningful margin. It has been confirmed by others, who kindly tested it on their machines. Surprisingly, during this and previous review sessions, I didn't hear a single good word from you, but a constant stream of blame: I do not answer questions, I do not write perfect code, I fail to provide satisfying answers, I'm waving hands, saying insane things etc etc. Any minimal disagreement with you and you're basically raising the tone. I find this style of discussions irritating and non-productive. So I'm taking a break and start working on the next version.