On Wed, Nov 29, 2017 at 5:54 AM, Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> wrote: > Hi, > There is new discoveries! > > When I find qlist_move_cache reappear in my environment, > I use kgdb to break into function qlist_move_cache. I found > this function is called because of cgroup release. > > I also find libvirt allocate a memory croup for each qemu it started, > in my system, it looks like this: > > root@ednserver3:/sys/fs/cgroup/memory/machine.slice# ls > cgroup.clone_children machine-qemu\x2d491_25_30.scope > machine-qemu\x2d491_40_30.scope machine-qemu\x2d491_6_30.scope > memory.limit_in_bytes > cgroup.event_control machine-qemu\x2d491_26_30.scope > machine-qemu\x2d491_41_30.scope machine-qemu\x2d491_7_30.scope > memory.max_usage_in_bytes > cgroup.procs machine-qemu\x2d491_27_30.scope > machine-qemu\x2d491_4_30.scope machine-qemu\x2d491_8_30.scope > memory.move_charge_at_immigrate > machine-qemu\x2d491_10_30.scope machine-qemu\x2d491_28_30.scope > machine-qemu\x2d491_47_30.scope machine-qemu\x2d491_9_30.scope > memory.numa_stat > machine-qemu\x2d491_11_30.scope machine-qemu\x2d491_29_30.scope > machine-qemu\x2d491_48_30.scope memory.failcnt > memory.oom_control > machine-qemu\x2d491_12_30.scope machine-qemu\x2d491_30_30.scope > machine-qemu\x2d491_49_30.scope memory.force_empty > memory.pressure_level > machine-qemu\x2d491_13_30.scope machine-qemu\x2d491_31_30.scope > machine-qemu\x2d491_50_30.scope memory.kmem.failcnt > memory.soft_limit_in_bytes > machine-qemu\x2d491_17_30.scope machine-qemu\x2d491_32_30.scope > machine-qemu\x2d491_51_30.scope memory.kmem.limit_in_bytes > memory.stat > machine-qemu\x2d491_18_30.scope machine-qemu\x2d491_33_30.scope > machine-qemu\x2d491_52_30.scope memory.kmem.max_usage_in_bytes > memory.swappiness > machine-qemu\x2d491_19_30.scope machine-qemu\x2d491_34_30.scope > machine-qemu\x2d491_5_30.scope memory.kmem.slabinfo > memory.usage_in_bytes > machine-qemu\x2d491_20_30.scope machine-qemu\x2d491_35_30.scope > machine-qemu\x2d491_53_30.scope memory.kmem.tcp.failcnt > memory.use_hierarchy > machine-qemu\x2d491_21_30.scope machine-qemu\x2d491_36_30.scope > machine-qemu\x2d491_54_30.scope memory.kmem.tcp.limit_in_bytes > notify_on_release > machine-qemu\x2d491_22_30.scope machine-qemu\x2d491_37_30.scope > machine-qemu\x2d491_55_30.scope memory.kmem.tcp.max_usage_in_bytes > tasks > machine-qemu\x2d491_23_30.scope machine-qemu\x2d491_38_30.scope > machine-qemu\x2d491_56_30.scope memory.kmem.tcp.usage_in_bytes > machine-qemu\x2d491_24_30.scope machine-qemu\x2d491_39_30.scope > machine-qemu\x2d491_57_30.scope memory.kmem.usage_in_bytes > > and in each memory cgroup there are many slabs: > root@ednserver3:/sys/fs/cgroup/memory/machine.slice/machine-qemu\x2d491_10_30.scope# > cat memory.kmem.slabinfo > slabinfo - version: 2.1 > # name <active_objs> <num_objs> <objsize> <objperslab> > <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : > slabdata <active_slabs> <num_slabs> <sharedavail> > kmalloc-2048 0 0 2240 3 2 : tunables 24 12 > 8 : slabdata 0 0 0 > kmalloc-512 0 0 704 11 2 : tunables 54 27 > 8 : slabdata 0 0 0 > skbuff_head_cache 0 0 384 10 1 : tunables 54 27 > 8 : slabdata 0 0 0 > kmalloc-1024 0 0 1216 3 1 : tunables 24 12 > 8 : slabdata 0 0 0 > kmalloc-192 0 0 320 12 1 : tunables 120 60 > 8 : slabdata 0 0 0 > pid 3 21 192 21 1 : tunables 120 60 > 8 : slabdata 1 1 0 > signal_cache 0 0 1216 3 1 : tunables 24 12 > 8 : slabdata 0 0 0 > sighand_cache 0 0 2304 3 2 : tunables 24 12 > 8 : slabdata 0 0 0 > fs_cache 0 0 192 21 1 : tunables 120 60 > 8 : slabdata 0 0 0 > files_cache 0 0 896 4 1 : tunables 54 27 > 8 : slabdata 0 0 0 > task_delay_info 3 72 112 36 1 : tunables 120 60 > 8 : slabdata 2 2 0 > task_struct 3 3 3840 1 1 : tunables 24 12 > 8 : slabdata 3 3 0 > radix_tree_node 0 0 728 5 1 : tunables 54 27 > 8 : slabdata 0 0 0 > shmem_inode_cache 2 9 848 9 2 : tunables 54 27 > 8 : slabdata 1 1 0 > inode_cache 39 45 744 5 1 : tunables 54 27 > 8 : slabdata 9 9 0 > ext4_inode_cache 0 0 1224 3 1 : tunables 24 12 > 8 : slabdata 0 0 0 > sock_inode_cache 3 8 832 4 1 : tunables 54 27 > 8 : slabdata 2 2 0 > proc_inode_cache 0 0 816 5 1 : tunables 54 27 > 8 : slabdata 0 0 0 > dentry 52 90 272 15 1 : tunables 120 60 > 8 : slabdata 6 6 0 > anon_vma 140 348 136 29 1 : tunables 120 60 > 8 : slabdata 12 12 0 > anon_vma_chain 257 468 112 36 1 : tunables 120 60 > 8 : slabdata 13 13 0 > vm_area_struct 510 780 272 15 1 : tunables 120 60 > 8 : slabdata 52 52 0 > mm_struct 1 3 1280 3 1 : tunables 24 12 > 8 : slabdata 1 1 0 > cred_jar 12 24 320 12 1 : tunables 120 60 > 8 : slabdata 2 2 0 > > So, when I end the libvirt scenery, those slabs belong to those qemus > has to invoke quarantine_remove_cache, > I guess that's why qlist_move_cache occupies so much CPU cycles. I > also guess this make libvirt complain > (wait for too long?) > > Sorry not to research deeply into system in the first place and submit > a patch in a hurry. > > And I propose a little sugguestion to improve qlist_move_cache if you > like. Won't we design some kind of hash mechanism, > then we group the qlist_node according to their cache, so as not to > compare one by one to every qlist_node in the system. Yes, quarantine_remove_cache() is very slow because it walk a huge linked list and synchronize_srcu() does not help either. It would be great to make it faster rather than peppering over the problem with rescheds. Please detail your scheme. Note that quarantine needs to be [best-effort] global FIFO and that the main operations are actually kmalloc/kfree, so we should not penalize them either. We also have limited memory in memory blocks. I had some ideas but I couldn't come up with a complete solution that I would like. One thing is that we could first check if the cache actually has _any_ outstanding objects. Looking at your slabinfo dump, it seems that lots of them don't have active objects. In that case we can skip all of quarantine_remove_cache entirely. I see there is already a function for this: static int shutdown_cache(struct kmem_cache *s) { /* free asan quarantined objects */ kasan_cache_shutdown(s); if (__kmem_cache_shutdown(s) != 0) return -EBUSY; So maybe we could do just: static int shutdown_cache(struct kmem_cache *s) { if (__kmem_cache_shutdown(s) != 0) { /* free asan quarantined objects */ kasan_cache_shutdown(s); if (__kmem_cache_shutdown(s) != 0) return -EBUSY; } We could also make cache freeing asynchronous. Then we could either just wait when the cache doesn't have any active objects (walk and check all deferred caches after each quarantine_reduce()), or accumulate a batch of them and then walk quarantine once and remove objects for the batch of caches (this would amortize overhead by batch size). As far as I understand in lots of cases caches are freed in large batches (cgroups, namespaces), and that's exactly when quarantine_remove_cache() performance is a problem. Or we could make quarantine a doubly-linked list and then walk all active objects in the cache (is it possible?) and remove them from quarantine by shuffling next/prev pointers. However, this can increase memory consumption and penalize performance of other operations. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>