On Wed, Oct 18, 2023 at 10:47 AM Chen Hu <hu1.chen@xxxxxxxxx> wrote: > > *Problem* > ovl_permission() checks the underlying inode with the credential of mounter. > The cred, struct ovl_fs::creator_cred, is somewhat global per overlayfs > superblock. Performance degrades when concurrency increases on the cred, to be > specific, on ovl_fs::creator_cred::usage. > > This happens when there are massive file access inside container, especially > on SoC with many cores. With Linux 6.6.0-rc2, we run a web workload container > on Intel 4th Xeon Sapphire Rapids which has 56 cores. Perf reports that 5.7% > (2.50% + 1.87% + 1.33%) CPU stall in overlayfs: > Self Command Shared Object Symbol > 2.50% foo [kernel.vmlinux] [k] override_creds > 1.87% foo [kernel.vmlinux] [k] revert_creds > 1.33% foo [kernel.vmlinux] [k] generic_permission > > On Soc with more than 100 cores, we can even observe ~30% CPU stalled! > > This scalability issue is caused by two factors: > 1) Contention on creator_cred::usage > creator_cred::usage is atomic_t and is inc/dec atomically during every file > access. So HW acquires the corresponding cache line exclusively. This > operataiton is expensive and gets worse when contention is heavy. > Call chain: > ovl_permission() > -> ovl_override_creds() > -> override_creds() > -> get_new_cred() > -> atomic_inc(&cred->usage); > > ovl_permission() > -> revert_creds() > -> put_cred() > -> atomic_dec_and_test(&(cred)->usage)) > > 2) False sharing > `perf c2c` shows false sharing issue between cred::usage and cred::fsuid. > This is why generic_permission() stalls 1.33% CPU in above perf report. > ovl_permission() updates cred::usage and it also reads cred::fsuid. > Unfortunately, they locate in the same cache line and thus false sharing > occurs. cred::fsuid is read at: > ovl_permission() > -> inode_permission() > -> generic_permission() > -> acl_permission_check() > -> current_fsuid() > > *Mitigations we tried* > We tried several mitigations but are not sure if it can be a fix or just > workaround / hack. So we report this and want to have some discussions. > > Our mitigations aims to eliminate the contention on creator_cred->usage. > Without contention, the false sharing will be tiny and no need to handle. The > mitigations we tested are: > 1) Check underlying inode once in its lifetime. But the check is against a specific permission mask. Your patch caches the result of the permission check of a specific mask and uses it as the result to return for any mask. > OR > 2) In ovl_permission(), copy global creator_cred to a local variable to > avoid concurrency. > This sounds a bit risky, but maybe it can work. If you want to create a local copy of creds, I think that the fact that this is a local copy should be expressed in flags like cred->non_rcu. put_cred() should be aware of this flag and avoid calling __put_cred(). The local copy should be initialized with usage 1 by copying creator_cred and we need to have an assertion if cred->usage drops to 0. Also, ofs->creator_cred itself should be marked as a "read-only copy" of credentials and we should add assertions to make sure that no code calls get_cred() on a read-only copy. The ovl_override_cred() function should take a local cred variable to use the copy method for any access to ofs->creator_cred, not only in ovl_permissions(). ovl_override_creds() should be coupled with ovl_revert_creds() which also takes the local var as argument and also asserts that the local copy usage is 1. We can maybe take the opportunity to DEFINE_GUARD() for an ovl_cred struct and use it in many of the overlayfs methods. And maybe I am missing something and this cannot be done or there is a much easier way to solve the problem. Thanks, Amir.