Hi!
After applying the below patch, the 5 most problematic servers have run
without any issues for 23 hours. That never happened before the patch on
5.14, so the patch seems to have fixed the issue for me.
On Monday there will be more load on the servers, which caused them to
crash faster without the patch. I will let you know if it happens again.
Best regards,
Rune
On 16/10/2021 00:10, Eric W. Biederman wrote:
In commit fda31c50292a ("signal: avoid double atomic counter
increments for user accounting") Linus made a clever optimization to
how rlimits and the struct user_struct. Unfortunately that
optimization does not work in the obvious way when moved to nested
rlimits. The problem is that the last decrement of the per user
namespace per user sigpending counter might also be the last decrement
of the sigpending counter in the parent user namespace as well. Which
means that simply freeing the leaf ucount in __free_sigqueue is not
enough.
Maintain the optimization and handle the tricky cases by introducing
inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
By moving the entire optimization into functions that perform all of
the work it becomes possible to ensure that every level is handled
properly.
I wish we had a single user across all of the threads whose rlimit
could be charged so we did not need this complexity.