Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

"Huang, Kai" <kai.huang@xxxxxxxxx> · Tue, 16 Apr 2024 13:22:06 +0000

On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@xxxxxxxxxxxxxxx>
> 
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The
> existing cgroup memory controller cannot be used to limit or account for
> SGX EPC memory, which is a desirable feature in some environments. For
> instance, within a Kubernetes environment, while a user may specify a
> particular EPC quota for a pod, the orchestrator requires a mechanism to
> enforce that the pod's actual runtime EPC usage does not exceed the
> allocated quota.
> 
> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
> limit and track EPC allocations per cgroup. Earlier patches have added
> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
> support in SGX driver as the "sgx_epc" resource provider:
> 
> - Set "capacity" of EPC by calling misc_cg_set_capacity()
> - Update EPC usage counter, "current", by calling charge and uncharge
> APIs for EPC allocation and deallocation, respectively.
> - Setup sgx_epc resource type specific callbacks, which perform
> initialization and cleanup during cgroup allocation and deallocation,
> respectively.
> 
> With these changes, the misc cgroup controller enables users to set a hard
> limit for EPC usage in the "misc.max" interface file. It reports current
> usage in "misc.current", the total EPC memory available in
> "misc.capacity", and the number of times EPC usage reached the max limit
> in "misc.events".
> 
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
> 
> Later patches will reorganize the tracking and reclamation code in the
> global reclaimer and implement per-cgroup tracking and reclaiming.
> 
> Co-developed-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> Signed-off-by: Kristen Carlson Accardi <kristen@xxxxxxxxxxxxxxx>
> Co-developed-by: Haitao Huang <haitao.huang@xxxxxxxxxxxxxxx>
> Signed-off-by: Haitao Huang <haitao.huang@xxxxxxxxxxxxxxx>
> Reviewed-by: Jarkko Sakkinen <jarkko@xxxxxxxxxx>
> Reviewed-by: Tejun Heo <tj@xxxxxxxxxx>
> Tested-by: Jarkko Sakkinen <jarkko@xxxxxxxxxx>

I don't see any big issue, so feel free to add:

Reviewed-by: Kai Huang <kai.huang@xxxxxxxxx>

Nitpickings below:

[...]

> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2022-2024 Intel Corporation. */
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>

It doesn't seem you need the above two here.

Probably they are needed in later patches, in that case we can move to the
relevant patch(es) that they got used.

However I think it's better to explicitly include <linux/slab.h> since
kzalloc()/kfree() are used.

Btw, I am not sure whether you want to use <linux/kernel.h> because looks
it contains a lot of unrelated staff.  Anyway I guess nobody cares.

> +#include "epc_cgroup.h"
> +
> +/* The root SGX EPC cgroup */
> +static struct sgx_cgroup sgx_cg_root;

The comment isn't necessary (sorry didn't notice before), because the code
is pretty clear saying that IMHO.

[...]

> 
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,72 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _SGX_EPC_CGROUP_H_
> +#define _SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>

I don't see why you need <asm/sgx.h> here.  Also, ...

> +#include <linux/cgroup.h>
> +#include <linux/misc_cgroup.h>
> +
> +#include "sgx.h"

... "sgx.h" already includes <asm/sgx.h>

[...]

> 
> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
> +{
> +	/* get_current_misc_cg() never returns NULL when Kconfig enabled */
> +	return sgx_cgroup_from_misc_cg(get_current_misc_cg());
> +}

I spent some time looking into this.  And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when Kconfig
is on.

I typed my analysis below in [*].  And it would be helpful if any cgroup
expert can have a second eye on this.

[...]

> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
>  #include <linux/highmem.h>
>  #include <linux/kthread.h>
>  #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>

Is this needed?  I believe SGX variants in "epc_cgroup.h" should be enough
for sgx/main.c?

[...]

[*] IIUC get_current_misc_cg() should never return NULL when Kconfig is on
(code indent slight adjusted for text wrap).

Firstly, during kernel boot there's always a valid @css allocated for MISC
cgroup, regardless whether it is disabled in kernel command line.

	int __init cgroup_init(void)
	{
		...

 	        for_each_subsys(ss, ssid) {                                    
			if (ss->early_init) {                                                 
				...
                	} else {                                                
                		cgroup_init_subsys(ss, false);                         
                	}

			...

                	if (!cgroup_ssid_enabled(ssid))
                        	continue;
			...
		}
		...
	}

cgroup_init_subsys() makes a valid @css is allocated for MISC cgroup and
set the pointer to the @init_css_set.

	static void __init cgroup_init_subsys(struct cgroup_subsys *ss, 
						...)
	{
        	struct cgroup_subsys_state *css;

		...
        	css = ss->css_alloc(NULL);
        	/* We don't handle early failures gracefully */
        	BUG_ON(IS_ERR(css));

		...

        	init_css_set.subsys[ss->id] = css;

		...
	}

All processes are by default associated to the @init_css_set:

	void cgroup_fork(struct task_struct *child)
	{
        	RCU_INIT_POINTER(child->cgroups, &init_css_set);
        	INIT_LIST_HEAD(&child->cg_list);
	}

At runtime, when a new cgroup is created in the hierarchy, the "cgroup"
can have a NULL @css if some subsystem is not enabled in it:

	static int cgroup_apply_control_enable(struct cgroup *cgrp)             
	{       
        	struct cgroup *dsct;                                            
        	struct cgroup_subsys_state *d_css;                              
        	struct cgroup_subsys *ss;
        	int ssid, ret;

        	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {        
                	for_each_subsys(ss, ssid) {
                        	struct cgroup_subsys_state *css = 
						cgroup_css(dsct, ss);                                              

                        	if (!(cgroup_ss_mask(dsct) & 
					(1 << ss->id)))                                                     
                                	continue;                               

	                        if (!css) {                                     
	                                css = css_create(dsct, ss);             
	                                if (IS_ERR(css))                        
	                                        return PTR_ERR(css);            
	                        }
				...
			}
		}
	}

We can see if cgroup_ss_mask(dsct) doesn't have subsystem enabled, the
css_create() won't be invoked, and cgroup->subsys[ssid] will remain NULL.

However, when a process is bound to a specific cgroup, the kernel tries to
get the cgorup's "effective css", and it seems this "effective css" cannot
be NULL if the subsys has a valid 'struct cgroup_subsys' provided, which
the MISC cgroup does.

There are couple of code paths can lead to this, but they all reach to

	static struct css_set *find_existing_css_set(...)
	{
	        struct cgroup_root *root = cgrp->root;
	        struct cgroup_subsys *ss;
	        struct css_set *cset;
	        unsigned long key;
	        int i;  

	        for_each_subsys(ss, i) {
	                if (root->subsys_mask & (1UL << i)) {
	                        /*
	                         * @ss is in this hierarchy, so we want
	                         * the effective css from @cgrp.
	                         */
	                        template[i] = cgroup_e_css_by_mask(cgrp,
						ss); 
	                } else {        
	                        /*
	                         * @ss is not in this hierarchy, so we
	                         * don't want to change the css.
	                         */
	                        template[i] = old_cset->subsys[i];
	                }
	        }
		...
	}

Which calls cgroup_e_css_by_mask() to get the "effective css" when subsys
is enabled in the root cgroup (which means MISC cgroup is not disabled by
kernel command line), or get the default css, which is @init_css_set-
>subsys[ssid], which is always valid for MISC cgroup.

And more specifically, the "effective css" in the cgroup_e_css_by_mask()
is done by searching the entire hierarchy, so MISC cgroup will always have
a valid "effective css".

	static struct cgroup_subsys_state *cgroup_e_css_by_mask(
				struct cgroup *cgrp,
				struct cgroup_subsys *ss)
	{               
        	lockdep_assert_held(&cgroup_mutex);

        	if (!ss)
        	        return &cgrp->self; 

		...
        	while (!(cgroup_ss_mask(cgrp) & (1 << ss->id))) {
        	        cgrp = cgroup_parent(cgrp);
       	        	if (!cgrp)
                        	return NULL;
        	}

		return cgroup_css(cgrp, ss);
	}

The comment of cgroup_e_css_by_mask() says:

* Similar to cgroup_css() but returns the effective css, which is defined
* as the matching css of the nearest ancestor including self which has @ss
* enabled.  If @ss is associated with the hierarchy @cgrp is on, this
* function is guaranteed to return non-NULL css.

It's hard for me to interpret the second sentence, specifically, what does
"@ss is associated with the hierarchy @cgrp is on" mean.  I interpret it
as "subsys is enabled in root and/or any descendants".

But again, in the find_existing_css_set() it is called when the root
cgroup has enabled the subsys, so it should always return a non-NULL css.

And that means for any process, get_current_misc_cg() cannot be NULL.