From: SeongJae Park <sjpark@xxxxxxxxx> On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls make the number of active slab objects including 'sock_inode_cache' type rapidly and continuously increase. As a result, memory pressure occurs. In more detail, I made an artificial reproducer that resembles the workload that we found the problem and reproduce the problem faster. It merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes about 2 minutes. On 40 CPU cores, 70GB DRAM machine, the available memory continuously reduced in a fast speed (about 120MB per second, 15GB in total within the 2 minutes). Note that the issue don't reproduce on every machine. On my 6 CPU cores machine, the problem didn't reproduce. 'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the relevant memory objects. They are asynchronously invoked by the work queues and internally use 'rcu_barrier()' to ensure safe destructions. 'cleanup_net()' works in a batched maneer in a single thread worker, while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the 'system_wq'. Therefore, 'fqdir_work_fn()' called frequently under the workload and made the contention for 'rcu_barrier()' high. In more detail, the global mutex, 'rcu_state.barrier_mutex' became the bottleneck. I tried making 'rcu_barrier()' and subsequent lightweight works in 'fqdir_work_fn()' to be processed by a dedicated singlethread worker in batch and confirmed it works. After the change, No continuous memory reduction but some fluctuation observed. Nevertheless, the available memory reduction was only up to about 400MB. The following patch is for the change. I think this is the right solution for point fix of this issue, but someone might blame different parts. 1. User: Frequent 'unshare()' calls >From some point of view, such frequent 'unshare()' calls might seem only insane. 2. Global mutex in 'rcu_barrier()' Because of the global mutex, 'rcu_barrier()' callers could wait long even after the callbacks started before the call finished. Therefore, similar issues could happen in another 'rcu_barrier()' usages. Maybe we can use some wait queue like mechanism to notify the waiters when the desired time came. I personally believe applying the point fix for now and making 'rcu_barrier()' improvement in longterm make sense. If I'm missing something or you have different opinion, please feel free to let me know. Patch History ------------- Changes from v2 (https://lore.kernel.org/lkml/20201210080844.23741-1-sjpark@xxxxxxxxxx/) - Add numbers after the patch (Eric Dumazet) - Make only 'rcu_barrier()' and subsequent lightweight works serialized (Eric Dumazet) Changes from v1 (https://lore.kernel.org/netdev/20201208094529.23266-1-sjpark@xxxxxxxxxx/) - Keep xmas tree variable ordering (Jakub Kicinski) - Add more numbers (Eric Dumazet) - Use 'llist_for_each_entry_safe()' (Eric Dumazet) SeongJae Park (1): net/ipv4/inet_fragment: Batch fqdir destroy works include/net/inet_frag.h | 1 + net/ipv4/inet_fragment.c | 45 +++++++++++++++++++++++++++++++++------- 2 files changed, 39 insertions(+), 7 deletions(-) -- 2.17.1