On Wed, Jun 15, 2022 at 12:16:10PM +0800, yueluck wrote: > add a detailed attachment > > > > > > > > > > At 2022-06-15 12:14:23, "yueluck" <yueluck@xxxxxxx> wrote: > > Hi, both of you: > Sorry to trouble you, because rcu is too complicated. > I encounter many hung processes which are normal container-runc, the number of which increases continuely and system load becomes higher and os reboots. > There is a related link https://access.redhat.com/solutions/5224631 , I do not have access to this document, so I cannot say anything about their offered solution. They do claim to have a solution, though, so I strongly suggest you follow their suggestions. Me, I work with mainline, and the 4.18 kernel that you are running was almost four years ago. > the call stack and scene are similar. patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1d1f898df6586c5ea9aeaf349f13089c6fa37903 What happens when you apply this patch? > process is never waken up after synchronize_rcu. > Could you pleae have a look at the call stack(attachment) and give me some idea? > source code : https://github.com/bigclouds/linux-4.18.0-147.3.1.el8 Do you see RCU CPU stall warnings? Please see the Linux-kernel file named Documentation/RCU/stallwarn.* for more information. (The "*" might be "txt" or "rst" depending on how old your kernel source tree is.) In particular, this file describes various things that can prevent synchronize_rcu() from returning, ranging from CPUs spinning with interrupts disabled to malfunctioning timer hardware. If you do not see stall warnings, have they been disabled? The values of the RCU_CPU_STALL_TIMEOUT Kconfig option and the kernel boot parameter rcupdate.rcu_cpu_stall_suppress control this, as does the rcupdate.rcu_cpu_stall_suppress_at_boot kernel parameter. So if the RCU CPU stall warnings have been disabled, please re-enable them. They give much more information on these sorts of problems. Plus there is the usual debugging advice, for example, if this is a new problem, look at what has changed at about the time that the problem appeared. For example, things like this can happen when backporting fixes or when bringing up new hardware. Also, please apply whatever debugging tools you have to check the health of the CPUs, for example, to see if any are spinning with preeemption or interrupts disabled. Or even if any are in a tight loop in the kernel. (No, this will not be visible from the stack trace of the task blocked in synchronize_rcu().) And again, please read Documentation/RCU/stallwarn.* carefully, preferably getting the version from a recent kernel such as v5.18. This document contains lots of information on causes of this sort of problem. Thanx, Paul > Thanks, > > > > > > > ------env----------------------- > centos 4.18.0-147.3.1.el8_1.3 > -------ps------------------------ > $ ps -aux| grep 156623 > root 156623 0.0 0.0 24012 9044 ? D May31 0:00 runc init > ------stack---------------------- > sudo cat /proc/156623/stack > Password: > [<0>] __wait_rcu_gp+0x117/0x140 > [<0>] synchronize_rcu+0x6f/0x80 > [<0>] namespace_unlock+0x67/0x80 > [<0>] ksys_umount+0x231/0x450 > [<0>] __x64_sys_umount+0x12/0x20 > [<0>] do_syscall_64+0x5b/0x1c0 > [<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca > [<0>] 0xffffffffffffffff > test:/var/log$ sudo cat /proc/156623/stat > ----------------------------------