Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

"Huang, Kai" <kai.huang@xxxxxxxxx> · Tue, 20 Feb 2024 09:52:39 +0000

> +/*
> + * Get the lower bound of limits of a cgroup and its ancestors.  Used in
> + * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup is
> + * over its limit or its ancestors' hence reclamation is needed.
> + */
> +static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
> +{
> +	struct misc_cg *i = epc_cg->cg;
> +	u64 m = U64_MAX;
> +
> +	while (i) {
> +		m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
> +		i = misc_cg_parent(i);
> +	}
> +
> +	return m / PAGE_SIZE;
> +}

I am not sure, but is it possible or legal for an ancestor to have less limit
than children?

> +
>  /**
> - * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
> + * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
> + * @root:	Root of the tree to check
>   *
> + * Return: %true if all cgroups under the specified root have empty LRU lists.
> + * Used to avoid livelocks due to a cgroup having a non-zero charge count but
> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
> + * because all pages in the cgroup are unreclaimable.
> + */
> +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
> +{
> +	struct cgroup_subsys_state *css_root;
> +	struct cgroup_subsys_state *pos;
> +	struct sgx_epc_cgroup *epc_cg;
> +	bool ret = true;
> +
> +	/*
> +	 * Caller ensure css_root ref acquired
> +	 */
> +	css_root = &root->css;
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		if (!css_tryget(pos))
> +			break;
> +
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +
> +		spin_lock(&epc_cg->lru.lock);
> +		ret = list_empty(&epc_cg->lru.reclaimable);
> +		spin_unlock(&epc_cg->lru.lock);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (!ret)
> +			break;
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +/**
> + * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to reclaim pages
> + * @root:	Root of the tree to start walking from.
> + * Return:	Number of pages reclaimed.

Just wondering, do you need to return @cnt given this function is called w/o
checking the return value?

> + */
> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
> +{
> +	/*
> +	 * Attempting to reclaim only a few pages will often fail and is
> +	 * inefficient, while reclaiming a huge number of pages can result in
> +	 * soft lockups due to holding various locks for an extended duration.
> +	 */

Not sure we need this comment, given it's already implied in
sgx_reclaim_pages().  You cannot pass a value > SGX_NR_TO_SCAN anyway.

> +	unsigned int nr_to_scan = SGX_NR_TO_SCAN;
> +	struct cgroup_subsys_state *css_root;
> +	struct cgroup_subsys_state *pos;
> +	struct sgx_epc_cgroup *epc_cg;
> +	unsigned int cnt;
> +
> +	 /* Caller ensure css_root ref acquired */
> +	css_root = &root->css;
> +
> +	cnt = 0;
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		if (!css_tryget(pos))
> +			break;
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +		cnt += sgx_reclaim_pages(&epc_cg->lru, &nr_to_scan);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (!nr_to_scan)
> +			break;
> +	}
> +
> +	rcu_read_unlock();
> +	return cnt;
> +}

Here the @nr_to_scan is reduced by the number of pages that are isolated, but
not actually reclaimed (which is reflected by @cnt).

IIUC, looks you want to make this function do "each cycle" as what you mentioned
in the v8 [1]:

	I tested with that approach and found we can only target number of
pages  
	attempted to reclaim not pages actually reclaimed due to the
uncertainty  
	of how long it takes to reclaim pages. Besides targeting number of
	scanned pages for each cycle is also what the ksgxd does.

	If we target actual number of pages, sometimes it just takes too long.
I
	saw more timeouts with the default time limit when running parallel  
	selftests.

I am not sure what does "sometimes it just takes too long" mean, but what I am
thinking is you are trying to do some perfect but yet complicated code here.

For instance, I don't think selftest reflect the real workload, and I believe
adjusting the limit of a given EPC cgroup shouldn't be a frequent operation,
thus it is acceptable to use some easy-maintain code but less perfect code.

Here I still think having @nr_to_scan as a pointer is over-complicated.  For
example, we can still let sgx_reclaim_pages() to always scan SGX_NR_TO_SCAN
pages, but give up when there's enough pages reclaimed or when the EPC cgroup
and its descendants have been looped:

unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
{
	unsigned int cnt = 0;
	...

	css_for_each_descendant_pre(pos, css_root) {
		...
		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
		cnt += sgx_reclaim_pages(&epc_cg->lru);

		if (cnt >= SGX_NR_TO_SCAN)
			break;
	}

	...
	return cnt;
}

Yeah it may reclaim more than SGX_NR_TO_SCAN when the loop actually reaches any
descendants, but that should be rare and we don't care that much, do we?

But I'll leave to maintainers to judge.

[1]
https://lore.kernel.org/linux-kernel/CZ3CM9ZE39Q0.222HRSEUF8RFP@xxxxxxxxxx/T/#md7b062b43d249218369f921682dfa7f975735dd1

> +
> +/*
> + * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the cgroup
> + * when the cgroup is at/near its maximum capacity
> + */

I don't see this being "scheduled by sgx_epc_cgroup_try_charge()" here.  Does it
make more sense to move that code change to this patch for better review?

> +static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +	u64 cur, max;
> +
> +	epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
> +
> +	for (;;) {
> +		max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
> +
> +		/*
> +		 * Adjust the limit down by one page, the goal is to free up
> +		 * pages for fault allocations, not to simply obey the limit.
> +		 * Conditionally decrementing max also means the cur vs. max
> +		 * check will correctly handle the case where both are zero.
> +		 */
> +		if (max)
> +			max--;

With the below max -= SGX_NR_TO_SCAN/2 staff, do you still need this one?

> +
> +		/*
> +		 * Unless the limit is extremely low, in which case forcing
> +		 * reclaim will likely cause thrashing, force the cgroup to
> +		 * reclaim at least once if it's operating *near* its maximum
> +		 * limit by adjusting @max down by half the min reclaim size.

OK.  But why choose "SGX_NO_TO_SCAN * 2" as "extremely low"? E.g, could we
choose SGX_NR_TO_SCAN instead? 

IMHO at least we should at least put a comment to mention this.

And maybe you can have a dedicated macro for that in which way I believe the
code would be easier to understand?

> +		 * This work func is scheduled by sgx_epc_cgroup_try_charge

This has been mentioned in the function comment already.

> +		 * when it cannot directly reclaim due to being in an atomic
> +		 * context, e.g. EPC allocation in a fault handler.  
> 

Why a fault handler is an "atomic context"?  Just say when it cannot directly
reclaim.

> Waiting
> +		 * to reclaim until the cgroup is actually at its limit is less
> +		 * performant as it means the faulting task is effectively
> +		 * blocked until a worker makes its way through the global work
> +		 * queue.
> +		 */
> +		if (max > SGX_NR_TO_SCAN * 2)
> +			max -= (SGX_NR_TO_SCAN / 2);
> +
> +		cur = sgx_epc_cgroup_page_counter_read(epc_cg);
> +
> +		if (cur <= max || sgx_epc_cgroup_lru_empty(epc_cg->cg))
> +			break;
> +
> +		/* Keep reclaiming until above condition is met. */
> +		sgx_epc_cgroup_reclaim_pages(epc_cg->cg);

Also, each loop here calls sgx_epc_cgroup_max_pages_to_root() and
sgx_epc_cgroup_lru_empty(), both loop the given EPC cgroup and descendants.  If
we still make sgx_reclaim_pages() always scan SGX_NR_TO_SCAN pages, seems we can
reduce the number of loops here?

> +	}
> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
>   * @epc_cg:	The EPC cgroup to be charged for the page.
>   * Return:
>   * * %0 - If successfully charged.
> @@ -38,6 +209,7 @@ static void sgx_epc_cgroup_free(struct misc_cg *cg)
>  	if (!epc_cg)
>  		return;
>  
> +	cancel_work_sync(&epc_cg->reclaim_work);
>  	kfree(epc_cg);
>  }
>  
> @@ -50,6 +222,8 @@ const struct misc_res_ops sgx_epc_cgroup_ops = {
>  
>  static void sgx_epc_misc_init(struct misc_cg *cg, struct sgx_epc_cgroup *epc_cg)
>  {
> +	sgx_lru_init(&epc_cg->lru);
> +	INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
>  	cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
>  	epc_cg->cg = cg;
>  }
> @@ -69,6 +243,11 @@ static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
>  
>  void sgx_epc_cgroup_init(void)
>  {
> +	sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
> +					WQ_UNBOUND | WQ_FREEZABLE,
> +					WQ_UNBOUND_MAX_ACTIVE);
> +	BUG_ON(!sgx_epc_cg_wq);

You cannot BUG_ON() simply due to unable to allocate a workqueue.  You can use
some way to mark EPC cgroup as disabled but keep going.  Static key is one way
although we cannot re-enable it at runtime.

> +
>  	misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_epc_cgroup_ops);
>  	sgx_epc_misc_init(misc_cg_root(), &epc_cg_root);
>  }
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> index 6b664b4c321f..e3c6a08f0ee8 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -34,6 +34,8 @@ static inline void sgx_epc_cgroup_init(void) { }
>  #else
>  struct sgx_epc_cgroup {
>  	struct misc_cg *cg;
> +	struct sgx_epc_lru_list lru;
> +	struct work_struct reclaim_work;
>  };

So you introduced the work/workqueue here but there's no place which actually
queues the work.  IMHO you can either:

1) move relevant code change here; or
2) focus on introducing core functions to reclaim certain pages from a given EPC
cgroup w/o workqueue and introduce the work/workqueue in later patch.

Makes sense?

>  
>  static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
> @@ -66,6 +68,7 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
>  
>  int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg);
>  void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
> +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);

Not sure why this needs to be exposed.  Perhaps you should make this change when
needed.

>  void sgx_epc_cgroup_init(void);
>  
>  #endif