Introduction ------------ This patch series adds a new "managed mode" to percpu-refcounts for managing references for objects that are released after an RCU grace period has passed since their last reference drop. Typical usage pattern looks like below // Called with elevated refcount get() p = get_ptr(); kref_get(&p->count); return p; get() rcu_read_lock(); p = get_ptr(); if (p && !kref_get_unless_zero(&p->count)) p = NULL; rcu_read_unlock(); return p; release() remove_ptr(p); call_rcu(&p->rcu, freep); release() remove_ptr(p); kfree_rcu((p, rcu); Requirement and Use Case ------------------------ Percpu refcount requires an explicit percpu_ref_kill() operation at the object's usage site where the initial ref count is being dropped. For optimal performance, the object's usage should reach a teardown point, after which the references shouldn't be acquired or released frequently before the final reference is dropped. Following the percpu_ref_kill(), any refcount operations on the object are carried out on the centralized atomic counter. The performance and scalability of those usages decrease if the references are still being added or removed after the percpu_ref_kill() operation because of the atomic counter's cache line ping-pong between CPUs. The throughput scalability issue that is seen when Nginx runs with the AppArmor linux security module enabled is the primary motivation for this change. Performance profiling shows that memory contention in the atomic_fetch_add and atomic_fetch_sub operations carried out in kref_get() and kref_put() operations on AppArmor labels accounts for the majority of CPU cycles. Further information regarding the impact of performance on Nginx throughput scalabilityand enhancements through percpu references can be found in [1]. However, because of the way references are used in AppArmor, switching from kref usage to per-cpu refcount was found to be non-trivial. Although the specifics of AppArmor refcount management have already been covered at [1], the explanation that follows aims to update that information with more detailed (and hopefully more accurate) information that support the requirement of managed percpu ref. Within the AppArmor framework, label (struct aa_label) manages references for different kinds of objects. Labels are associated with: - Profiles for applications. - Namespaces, via their unconfined profile. - Audit, secmark rules and compound labels. Labels are referenced by file contexts, security contexts, secid, sockets. The diagram below illustrates the relationship between different AppArmor objects via their label references. ---------------- | Root Namespace | ---------------- / ^ | ^ (a) | (c) | / (b) | (d) v / v | ------------ ----------------- | Profile 1 | | Child Namespace | ------------ ----------------- | ^ | ^ (e) | (g) | | (f) | (h) v | v | --------------- ----------- | Child Profile | | Profile 2 | --------------- ----------- ^ ^ \ / \ / \ / (i) | ---------------- | Compound Label | ---------------- (a) The Root namespace keeps track of every profile that exists there. When a profile is loaded and unpacked, a reference to profile is taken for this. This reference to the profile object is also used its **init reference**. (b) Root namespace is referenced by a profile that is part of it. (c) To control confinement within a certain domain, such as a chroot environment, a root namespace may include child namespaces. Through each child namespace's unconfined label, the subnamespaces list in the root namespace maintains a (init) reference to child namespaces. (d) A child namespace maintains a reference to its parent namespace. (e) Profile can have child subprofiles which are called hat profiles. Certain program segments can be run with permissions differing from the base permissions using these profiles. For instance, executing user-supplied CGI programs in a different Apache profile, or running authorized and unauthenticated traffic in several OpenSSH profiles. By use of its policy profiles list, the parent profile maintains a reference to the child subprofiles. This serves as the child profile's init reference. (f) Child profiles keep a reference to their parent profile. (g) Child namespace keeps a reference to all profiles in it. (h) A reference to the parent non-root namespace is maintained by child profiles. (i) Application of context-specific application confinement is done using compound/stack labels. When ls is started from bash, for instance, the confinement rules for the profile /bin/bash///bin/ls may differ from the system-level rules for ls execution. Compund labels are vector of profiles and maintain reference to every profile in its vector. Label references ---------------- - Tasks are linked to labels via the security field of their cred. The cred label is copied from the parent task during the bprm exec's cred preparation, and the bprm is transitioned to the new label using the parent task's profile transition rules. A compound/stack label or the label of a single profile may be used in the transition depending on the perms rule for the bprm's path. When performing policy checks in AppArmor's security hooks for operations like file permissions, mkdir, rmdir, mount, and so on, the label linked to the task's cred is used. When the associated label is marked as stale, the cred label of a task can change (from within its context) while it is being executed. A task maintains references to previous labels for hat transitions, onexec labels, and nnp (no new privilege) labels for exec domain transition checks. Labels are cached in file context for file permissions checks on open files. As a result of task profile updates, this label is updated with new profiles from the task's current label during revalidations of cached file permissions. - Socket contexts store the labels of the current task and peer. - Profile fs maintains references to the label proxy and namespace in the inode->i_private fields. - The label parsed from the rule string is referenced by Secmark rule objects. - The label parsed from the rule string is referenced by audit rule objects. Label's Initial Ref Teardown ---------------------------- - When a profile is deleted, the initial reference on its label is dropped and it is no longer a part of the parent namespace or parent profile. Furthermore, every one of its child profiles is deleted recursively. As a result, all profiles that are reachable from the base profile have their initial reference removed in a cascaded manner. - When a namespace is destroyed, the initial reference to its unconfined label is dropped and it is removed from the parent namespace view. Furthermore, all profiles in that namespace, all sub namespaces, and all profiles inside those sub namespaces are recursively removed and their initial label reference is dropped. - The reference to parent label is dropped with the release of a label reference post its last reference drop. A profile's parent profile and namespace references are dropped upon ref release. On the namespace ref release path, a namespace drops its reference to its parent namespace. As part of the label release, references to profiles in the compound label's vector are removed. Stale Labels and Label Redirection ---------------------------------- - The label associated with profile/namespace that is deleted is marked as stale. When any profile of a compound label is stale, the compound label is also marked stale. - Label's proxy is used to redirect stale labels to the most recent or active version of the object. For example, when a profile is deleted, its proxy is redirected to the unconfined label of the namespace. This indicates that every application that the profile confined has been moved to an unconfined profile. In a same manner, proxy is redirected to the new profile's label when a profile is replaced. The proxy of a namespace's unconfined label is redirected to the unconfined label of its parent namespace on namespace deletion. Redirection to new label is done during reference get operation: struct aa_label *aa_get_newest_label(struct aa_label *l) { struct aa_label __rcu **l = &l->proxy->label; struct aa_label *c; rcu_read_lock(); do { c = rcu_dereference(*l); } while (c && !kref_get_unless_zero(&c->count)); rcu_read_unlock(); return c; } Label reclaims -------------- A label is completely initialized when it is linked to a namespace. Label destruction is deferred until the end of a RCU grace period which starts after the last reference drop. Enqueuing an RCU callback for label and associated object destruction is done from the ref release callback. void aa_label_kref(struct kref *kref) { struct aa_label *label = container_of(kref, struct aa_label, count); struct aa_ns *ns = labels_ns(label); if (!ns) { label_free_switch(label); return; } call_rcu(&label->rcu, label_free_rcu); } Using Label Stale operation for percpu_ref_kill()? -------------------------------------------------- Marking a label as stale can serve as a reference termination point since stale labels are redirected to the current label linked to its objects. There are other labels, though, that are not associated with namespaces or profiles. These labels are compound labels linked to audit and secmark rule rules or running tasks that contain those label references in their cred structure. These labels are: - The label that is created from rule string is referenced by audit rules. It is possible that a multi element vector audit rule label already exists in the root labelset or that a new label is created during audit rule init. The reference is removed upon audit rule free. It's possible that the created label is actively referenced from other contexts, causing atomic contention on the label's ref operations if percpu_ref_kill() is called on audit rule free. - The stacked labels which are created on profile exec/domain transitions are stored in task's cred structure. These labels are released when all tasks drop their cred reference to those labels. - Transition labels which are created during change hat or change profile transitions could be referenced by multiple tasks. These labels are released when all tasks drop their cred reference to those labels. - Tasks' most recent label is combined with and cached in open file contexts. These cached labels don't have a defined termination point and can be actively referenced from multiple contexts. - Other compound labels with similar ref lifetimes include pivotroot and secmark rules. There exist further scenarios in which stale references may still be referenced: - Stale flags on labels are set using plain writes, and until the CPU observes the stale flag, new references may be incremented or decreased on the stale label. - A task may make reference a namespace which is marked stale. - Stale cred label, for which a proxy points to its namespace's stale unconfined label, the stale unconfined label can be referenced until the cred label is updated. In summary, though percpuref kill can be used for labels when they are maked stale, compound labels are not guaranteed to be marked stale during their lifetime and they do not have a context where percpuref kill can be done. Proposed Solution ----------------- The solution proposed here attempt to address the issue of identifying the init reference drop context. A percpu ref manager thread keeps an extra reference to the ref. This additional reference is used as a (pseudo) init reference to the object. A percpu managed ref instance offloads its ref's release work to the ref manager thread. The ref manager thread uses the following sequence to periodically scan the list of managed refs and determine whether a ref is active: scan_ref() { bool active; percpu_ref_switch_to_atomic_sync(&ref); rcu_read_lock(); percpu_ref_put(&ref); active = percpu_ref_tryget(&ref); rcu_read_unlock(); if (active) percpu_ref_switch_to_percpu(&ref); } The sequence above drops the pseudo-init reference, converts the reference to atomic mode, and verifies (within RCU read side protection) that all references have been dropped. The reference is switched back to perCPU mode (with the pseudo-init reference obtained through the try operation) if there are any active references. The two approaches used in this patch series, with slightly differing permitted ref mode switches and semantics, are listed below. Approach 1 ---------- Approach 1 is implemented in patch 1 and has below semantics for ref init and switch. a. Init A ref can be set to managed mode at initialization time in percpu_ref_init(), by passing the PERCPU_REF_REL_MANAGED flag, or by calling percpu_ref_switch_to_managed() post init to switch a reinitable ref to managed mode. Deferred switches are used in situations like module initialization error, when the reference to an inited reference is released before the object is used. One example of this is the release of AppArmor labels which are not associated with a namespace, which is done without waiting for RCU grace period. Below are the allowed initialization modes for managed ref Atomic Percpu Dead Reinit Managed Managed-ref Y N Y Y Y b. Switching modes and operations Below are the allowed transitions for managed ref. To --> A P P(RI) M D D(RI) D(RI/M) KLL REI RES A y n y y n y y y y y P n n n n y n n y n n M n n n y n n y n y y P(RI) y n y y n y y y y y D(RI) y n y y n y y - y y D(RI/M) n n n y n n y - y y Modes: A - Atomic P - PerCPU M - Managed P(RI) - PerCPU with ReInit D(RI) - Dead with ReInit D(RI/M) - Dead with ReInit and Managed PerCPU Ref Ops: KLL - Kill REI - Reinit RES - Resurrect A percpu reference that has been switched to managed mode cannot be switched back to any other active mode. Managed ref is reinitialized to managed mode upon reinit/resurrect. Approach 2 ---------- The second approach provides a managed reference greater runtime mode switching flexibility. This may be helpful in situations where the object of a managed reference can enter a shutdown phase in some scenarios. For example, for stale singular/compund labels, user can directly call percpu_ref_kill() for the ref rather than waiting for the manager thread to process the ref. The init modes are the same as in the previous approach. Runtime mode switching provides the ability to convert from managed mode to unmanaged mode, hence enabling transitions to all reinitable modes. To --> A P P(RI) M D D(RI) D(RI/M) KLL REI RES A y n y y n y y y y y P n n n n y n n y n n M y* n y* y n y* y y* y y P(RI) y n y y n y y y y y D(RI) y n y y n y y - y y D(RI/M) y* n y* y n y* y - y y (RI) refers to modes whose initialization was done using PERCPU_REF_ALLOW_REINIT. The aforementioned transitions are permitted and may be indirect transitions. For example, when percpu_ref_switch_to_unmanaged() is invoked for it, managed ref switches to P(RI) mode. percpu_ref_switch_to_atomic() can be used to switch from P(RI) mode to A mode. Design Implications ------------------- 1. Deferring the release of a referenced object to the manager thread may delay its memory release. This can result in memory pressure. By turning a managed reference to an unmanaged ref and then executing percpu_ref_kill() on it at known shutdown points in the execution, this issue can be partially resolved using the second approach. Flush the scanning work on memory pressure is another strategy that can be used. 2. call_rcu_hurry() is used by percpu refcount lib to perform mode switch operations. Back to back hurry callbacks can impact energy efficiency. The current implementation allows moving the execution to housekeeping cores by using an unbounded workqueue. A deferrable timer can be used to prevent these invocations when the core is idle by delaying the worker execution. Deferring, though, may cause ref reclaims to be delayed. 3. Since the percpu refcount lib uses a single global switch spinlock, back-to-back label switches can delay other percpu users. 4. Long running kworkers may cause other use cases, such as system suspend, to be delayed. By using a freezable work queue and limiting node scans to a maximum count, this is mitigated. 5. Because all managed refs undergo switch-to-atomic mode operation serially, an inactive ref must wait for all prior grace periods to complete before it can be assessed. Ref release may be greatly delayed as a result of this. Batching ref switches can be one method to deal with this, ensuring that all of those RCU callbacks are completed by single grace period. 6. A label's refcount can operate in atomic mode within the window while its counter is being checked for zero. This could lead to high memory contention within the RCU grace period (together with callback execution) duration. In AppArmor, all application that use unconfined profiles will execute atomic ref increment and decrement operations on the ref during that window if the currently scanned label belongs to an unconfined profile. In order to handle this, a prototype is described and implemented in [1], which replaces the atomic and percpu counters of the scanned ref with a temporary percpu ref. Given that the grace period window is of small duration (compared to the scan interval), overall impact of this might not be significant enough to consider the massive complexity of that prototype implementation. This problem requires more investigation in order to find a simpler solution. Extended/Future Work -------------------- 1. Another design approach, which was considered was to define a new percpu rcuref type for RCU managed percpu refcounts. This approach is prototyped in [1]. Although this approach provides cleaner semantics w.r.t. mode switches and allowed operations, its current implementation, using composition of percpu ref, could be suboptimal in terms of the struct's cacheline space requirement and feature extensibility. An independent implementation would require refactoring of the common logic out of the percpu refcount implementation. Additionally, the users of new api could require the modes (ex. ref kill/reinit) supported by percpu refcount. Extending percpu rcuref to support this can result in duplication of functionality/semantics between the two percpu ref types. 2. Explore hazard pointers for scalable refcounting of objects, which provides a more generic solution and has more efficient memory space requirements. Below is the organization of the patches in this series: 1. Implementation of first approach described in "Proposed Solution" section. 2. Torture test for managed ref to validate early ref release and imbalanced refcount. The test is verified on AMD 4th Generation EPYC Processor wth 96C/192T with following test parameters: nusers = 300 nrefs = 50 niterations = 50000 onoff_holdoff = 5 onoff_interval = 10 3. Implementation of second approach described in "Proposed Solution" section. 4. Updates to torture test to test runtime mode switches from managed to unmanaged modes. 5. Switch Label refcount management to percpu ref in atomic mode. 6. Switch Label refcount management to managed mode. Highly appreciate any feedback/suggestions on the design approach. [1] https://lore.kernel.org/lkml/20240110111856.87370-7-Neeraj.Upadhyay@xxxxxxx/T/ - Neeraj Neeraj Upadhyay (6): percpu-refcount: Add managed mode for RCU released objects percpu-refcount: Add torture test for percpu refcount percpu-refcount: Extend managed mode to allow runtime switching percpu-refcount-torture: Extend test with runtime mode switches apparmor: Switch labels to percpu refcount in atomic mode apparmor: Switch labels to percpu ref managed mode .../admin-guide/kernel-parameters.txt | 69 +++ include/linux/percpu-refcount.h | 14 + lib/Kconfig.debug | 9 + lib/Makefile | 1 + lib/percpu-refcount-torture.c | 404 ++++++++++++++++++ lib/percpu-refcount.c | 329 +++++++++++++- lib/percpu-refcount.h | 6 + security/apparmor/include/label.h | 16 +- security/apparmor/include/policy.h | 8 +- security/apparmor/label.c | 12 +- security/apparmor/policy_ns.c | 2 + 11 files changed, 836 insertions(+), 34 deletions(-) create mode 100644 lib/percpu-refcount-torture.c create mode 100644 lib/percpu-refcount.h -- 2.34.1