On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote: > From: Kristen Carlson Accardi <kristen@xxxxxxxxxxxxxxx> > > SGX Enclave Page Cache (EPC) memory allocations are separate from normal > RAM allocations, and are managed solely by the SGX subsystem. The > existing cgroup memory controller cannot be used to limit or account for > SGX EPC memory, which is a desirable feature in some environments. For > instance, within a Kubernetes environment, while a user may specify a > particular EPC quota for a pod, the orchestrator requires a mechanism to > enforce that the pod's actual runtime EPC usage does not exceed the > allocated quota. > > Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to > limit and track EPC allocations per cgroup. Earlier patches have added > the "sgx_epc" resource type in the misc cgroup subsystem. Add basic > support in SGX driver as the "sgx_epc" resource provider: > > - Set "capacity" of EPC by calling misc_cg_set_capacity() > - Update EPC usage counter, "current", by calling charge and uncharge > APIs for EPC allocation and deallocation, respectively. > - Setup sgx_epc resource type specific callbacks, which perform > initialization and cleanup during cgroup allocation and deallocation, > respectively. > > With these changes, the misc cgroup controller enables users to set a hard > limit for EPC usage in the "misc.max" interface file. It reports current > usage in "misc.current", the total EPC memory available in > "misc.capacity", and the number of times EPC usage reached the max limit > in "misc.events". > > For now, the EPC cgroup simply blocks additional EPC allocation in > sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are > still tracked in the global active list, only reclaimed by the global > reclaimer when the total free page count is lower than a threshold. > > Later patches will reorganize the tracking and reclamation code in the > global reclaimer and implement per-cgroup tracking and reclaiming. > > Co-developed-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> > Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> > Signed-off-by: Kristen Carlson Accardi <kristen@xxxxxxxxxxxxxxx> > Co-developed-by: Haitao Huang <haitao.huang@xxxxxxxxxxxxxxx> > Signed-off-by: Haitao Huang <haitao.huang@xxxxxxxxxxxxxxx> > Reviewed-by: Jarkko Sakkinen <jarkko@xxxxxxxxxx> > Reviewed-by: Tejun Heo <tj@xxxxxxxxxx> > Tested-by: Jarkko Sakkinen <jarkko@xxxxxxxxxx> I don't see any big issue, so feel free to add: Reviewed-by: Kai Huang <kai.huang@xxxxxxxxx> Nitpickings below: [...] > --- /dev/null > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > @@ -0,0 +1,72 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright(c) 2022-2024 Intel Corporation. */ > + > +#include <linux/atomic.h> > +#include <linux/kernel.h> It doesn't seem you need the above two here. Probably they are needed in later patches, in that case we can move to the relevant patch(es) that they got used. However I think it's better to explicitly include <linux/slab.h> since kzalloc()/kfree() are used. Btw, I am not sure whether you want to use <linux/kernel.h> because looks it contains a lot of unrelated staff. Anyway I guess nobody cares. > +#include "epc_cgroup.h" > + > +/* The root SGX EPC cgroup */ > +static struct sgx_cgroup sgx_cg_root; The comment isn't necessary (sorry didn't notice before), because the code is pretty clear saying that IMHO. [...] > > --- /dev/null > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h > @@ -0,0 +1,72 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _SGX_EPC_CGROUP_H_ > +#define _SGX_EPC_CGROUP_H_ > + > +#include <asm/sgx.h> I don't see why you need <asm/sgx.h> here. Also, ... > +#include <linux/cgroup.h> > +#include <linux/misc_cgroup.h> > + > +#include "sgx.h" ... "sgx.h" already includes <asm/sgx.h> [...] > > +static inline struct sgx_cgroup *sgx_get_current_cg(void) > +{ > + /* get_current_misc_cg() never returns NULL when Kconfig enabled */ > + return sgx_cgroup_from_misc_cg(get_current_misc_cg()); > +} I spent some time looking into this. And yes if I was reading code correctly the get_current_misc_cg() should never return NULL when Kconfig is on. I typed my analysis below in [*]. And it would be helpful if any cgroup expert can have a second eye on this. [...] > --- a/arch/x86/kernel/cpu/sgx/main.c > +++ b/arch/x86/kernel/cpu/sgx/main.c > @@ -6,6 +6,7 @@ > #include <linux/highmem.h> > #include <linux/kthread.h> > #include <linux/miscdevice.h> > +#include <linux/misc_cgroup.h> Is this needed? I believe SGX variants in "epc_cgroup.h" should be enough for sgx/main.c? [...] [*] IIUC get_current_misc_cg() should never return NULL when Kconfig is on (code indent slight adjusted for text wrap). Firstly, during kernel boot there's always a valid @css allocated for MISC cgroup, regardless whether it is disabled in kernel command line. int __init cgroup_init(void) { ... for_each_subsys(ss, ssid) { if (ss->early_init) { ... } else { cgroup_init_subsys(ss, false); } ... if (!cgroup_ssid_enabled(ssid)) continue; ... } ... } cgroup_init_subsys() makes a valid @css is allocated for MISC cgroup and set the pointer to the @init_css_set. static void __init cgroup_init_subsys(struct cgroup_subsys *ss, ...) { struct cgroup_subsys_state *css; ... css = ss->css_alloc(NULL); /* We don't handle early failures gracefully */ BUG_ON(IS_ERR(css)); ... init_css_set.subsys[ss->id] = css; ... } All processes are by default associated to the @init_css_set: void cgroup_fork(struct task_struct *child) { RCU_INIT_POINTER(child->cgroups, &init_css_set); INIT_LIST_HEAD(&child->cg_list); } At runtime, when a new cgroup is created in the hierarchy, the "cgroup" can have a NULL @css if some subsystem is not enabled in it: static int cgroup_apply_control_enable(struct cgroup *cgrp) { struct cgroup *dsct; struct cgroup_subsys_state *d_css; struct cgroup_subsys *ss; int ssid, ret; cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) { for_each_subsys(ss, ssid) { struct cgroup_subsys_state *css = cgroup_css(dsct, ss); if (!(cgroup_ss_mask(dsct) & (1 << ss->id))) continue; if (!css) { css = css_create(dsct, ss); if (IS_ERR(css)) return PTR_ERR(css); } ... } } } We can see if cgroup_ss_mask(dsct) doesn't have subsystem enabled, the css_create() won't be invoked, and cgroup->subsys[ssid] will remain NULL. However, when a process is bound to a specific cgroup, the kernel tries to get the cgorup's "effective css", and it seems this "effective css" cannot be NULL if the subsys has a valid 'struct cgroup_subsys' provided, which the MISC cgroup does. There are couple of code paths can lead to this, but they all reach to static struct css_set *find_existing_css_set(...) { struct cgroup_root *root = cgrp->root; struct cgroup_subsys *ss; struct css_set *cset; unsigned long key; int i; for_each_subsys(ss, i) { if (root->subsys_mask & (1UL << i)) { /* * @ss is in this hierarchy, so we want * the effective css from @cgrp. */ template[i] = cgroup_e_css_by_mask(cgrp, ss); } else { /* * @ss is not in this hierarchy, so we * don't want to change the css. */ template[i] = old_cset->subsys[i]; } } ... } Which calls cgroup_e_css_by_mask() to get the "effective css" when subsys is enabled in the root cgroup (which means MISC cgroup is not disabled by kernel command line), or get the default css, which is @init_css_set- >subsys[ssid], which is always valid for MISC cgroup. And more specifically, the "effective css" in the cgroup_e_css_by_mask() is done by searching the entire hierarchy, so MISC cgroup will always have a valid "effective css". static struct cgroup_subsys_state *cgroup_e_css_by_mask( struct cgroup *cgrp, struct cgroup_subsys *ss) { lockdep_assert_held(&cgroup_mutex); if (!ss) return &cgrp->self; ... while (!(cgroup_ss_mask(cgrp) & (1 << ss->id))) { cgrp = cgroup_parent(cgrp); if (!cgrp) return NULL; } return cgroup_css(cgrp, ss); } The comment of cgroup_e_css_by_mask() says: * Similar to cgroup_css() but returns the effective css, which is defined * as the matching css of the nearest ancestor including self which has @ss * enabled. If @ss is associated with the hierarchy @cgrp is on, this * function is guaranteed to return non-NULL css. It's hard for me to interpret the second sentence, specifically, what does "@ss is associated with the hierarchy @cgrp is on" mean. I interpret it as "subsys is enabled in root and/or any descendants". But again, in the find_existing_css_set() it is called when the root cgroup has enabled the subsys, so it should always return a non-NULL css. And that means for any process, get_current_misc_cg() cannot be NULL.