Re: [PATCH 5.15] kernfs: switch global kernfs_rwsem lock to per-fs lock

Jeremi Piotrowski <jpiotrowski@xxxxxxxxxxxxxxxxxxx> · Fri, 29 Nov 2024 22:20:48 +0100

On 29/11/2024 13:12, Greg Kroah-Hartman wrote:
> On Fri, Nov 29, 2024 at 12:32:36PM +0100, Jeremi Piotrowski wrote:
>> From: Minchan Kim <minchan@xxxxxxxxxx>
>>
>> [ Upstream commit 393c3714081a53795bbff0e985d24146def6f57f ]
>>
>> The kernfs implementation has big lock granularity(kernfs_rwsem) so
>> every kernfs-based(e.g., sysfs, cgroup) fs are able to compete the
>> lock. It makes trouble for some cases to wait the global lock
>> for a long time even though they are totally independent contexts
>> each other.
>>
>> A general example is process A goes under direct reclaim with holding
>> the lock when it accessed the file in sysfs and process B is waiting
>> the lock with exclusive mode and then process C is waiting the lock
>> until process B could finish the job after it gets the lock from
>> process A.
>>
>> This patch switches the global kernfs_rwsem to per-fs lock, which
>> put the rwsem into kernfs_root.
>>
>> Suggested-by: Tejun Heo <tj@xxxxxxxxxx>
>> Acked-by: Tejun Heo <tj@xxxxxxxxxx>
>> Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
>> Link: https://lore.kernel.org/r/20211118230008.2679780-1-minchan@xxxxxxxxxx
>> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
>> Signed-off-by: Jeremi Piotrowski <jpiotrowski@xxxxxxxxxxxxxxxxxxx>
>> ---
>> Hi Stable Maintainers,
>>
>> This upstream commit fixes a kernel hang due to severe lock contention on
>> kernfs_rwsem that occurs when container workloads perform a lot of cgroupfs
>> accesses. Could you please apply to 5.15.y? I cherry-pick the upstream commit
>> to v5.15.173 and then performed `git format-patch`.
> 
> This should not hang, but rather just reduce contention, right? Do you
> have real performance numbers that show this is needed? What workloads 
> are overloading cgroupfs?

System hang due to the contention might be a more accurate description. On a
kubernetes node there is always a stream of processes
(systemd, kubelet, containerd, cadvisor) periodically opening/stating/reading cgroupfs
files. Java apps also love reading cgroup files. Other operations such as creation of
short-lived containers take a write lock on the rwsem when creating cgroups and when
creating veth netdevs. The veth netdev creation takes the rwsem when creating sysfs files.
Systemd service startup also contends for the same write lock.

It's not so much a particular workload as it is a matter of scale, the cgroupfs read
accesses scale with the number of containers on a host. With enough readers and the
right mix of writers, write operations can take minutes.

Here are some real performance number: I have a representative reproducer with 50 cgroupfs
readers in a loop and a container batch job every minute. `systemctl status` times out
after 1m30s, container creation takes over 4m causing the operations to pile up, making the
situation even worse. With this patch included, under the same load the operations finish in
~10s, preventing the system from becoming unresponsive.

This patch stops sysfs and cgroupfs modifications from contending for the same rwsem,
as well as lowering contention between different cgroup subsystems.

> And why not just switch them to 6.1.y kernels or newer?

I wish we could just do that. Right now all our users are on 5.15 and a lot of their
workloads are sensitive to changes to any part of the container stack including kernel
version. So they will gradually migrate to kernel 6.1.y and newer as part of upgrading
their clusters to a new kubernetes release after they validate their workloads on it.
This is a slow process and in the meantime they are hitting the issue that the patch
addresses. I'm sure there are other similar users of 5.15 out there.

> 
> thanks,
> 
> greg k-h

Thanks,
Jeremi