Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

"Haitao Huang" <haitao.huang@xxxxxxxxxxxxxxx> · Thu, 22 Feb 2024 12:09:28 -0600




On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai <kai.huang@xxxxxxxxx> wrote:

+/**
+ * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs  
to reclaim pages
+ * @root:	Root of the tree to start walking from.
+ * Return:	Number of pages reclaimed.

Just wondering, do you need to return @cnt given this function is called  
w/o
checking the return value?

Yes. Will add explicit commenting that we need scan fixed number of pages  
for attempted reclamation.
+ */
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
+{
+	/*
+	 * Attempting to reclaim only a few pages will often fail and is
+	 * inefficient, while reclaiming a huge number of pages can result in
+	 * soft lockups due to holding various locks for an extended duration.
+	 */

Not sure we need this comment, given it's already implied in
sgx_reclaim_pages().  You cannot pass a value > SGX_NR_TO_SCAN anyway.

Will rework on these comments to make them more meaningful.


[other comments/questions addressed in separate email threads]
[...]
+
+/*
+ * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the  
cgroup
+ * when the cgroup is at/near its maximum capacity
+ */

I don't see this being "scheduled by sgx_epc_cgroup_try_charge()" here.   
Does it
make more sense to move that code change to this patch for better review?


Right. This comment was left-over when I split the old patch.

+static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
+{
+	struct sgx_epc_cgroup *epc_cg;
+	u64 cur, max;
+
+	epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
+
+	for (;;) {
+		max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
+
+		/*
+		 * Adjust the limit down by one page, the goal is to free up
+		 * pages for fault allocations, not to simply obey the limit.
+		 * Conditionally decrementing max also means the cur vs. max
+		 * check will correctly handle the case where both are zero.
+		 */
+		if (max)
+			max--;

With the below max -= SGX_NR_TO_SCAN/2 staff, do you still need this one?


Logically still needed for case max <= SGX_NR_TO_SCAN * 2

+
+		/*
+		 * Unless the limit is extremely low, in which case forcing
+		 * reclaim will likely cause thrashing, force the cgroup to
+		 * reclaim at least once if it's operating *near* its maximum
+		 * limit by adjusting @max down by half the min reclaim size.

OK.  But why choose "SGX_NO_TO_SCAN * 2" as "extremely low"? E.g, could  
we
choose SGX_NR_TO_SCAN instead?
IMHO at least we should at least put a comment to mention this.

And maybe you can have a dedicated macro for that in which way I believe  
the
code would be easier to understand?

Good point. I think the value is kind of arbitrary. We consider  
enclaves/cgroups of 64K size are very small. If such a cgroup ever reaches  
the limit, then we don't aggressively reclaim to optimize #PF handling.  
User might as well just raise the limit if it is not performant.


+		 * This work func is scheduled by sgx_epc_cgroup_try_charge

This has been mentioned in the function comment already.

+		 * when it cannot directly reclaim due to being in an atomic
+		 * context, e.g. EPC allocation in a fault handler.

Why a fault handler is an "atomic context"?  Just say when it cannot  
directly
reclaim.


Sure.

Waiting
+		 * to reclaim until the cgroup is actually at its limit is less
+		 * performant as it means the faulting task is effectively
+		 * blocked until a worker makes its way through the global work
+		 * queue.
+		 */
+		if (max > SGX_NR_TO_SCAN * 2)
+			max -= (SGX_NR_TO_SCAN / 2);
+
+		cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+
+		if (cur <= max || sgx_epc_cgroup_lru_empty(epc_cg->cg))
+			break;
+
+		/* Keep reclaiming until above condition is met. */
+		sgx_epc_cgroup_reclaim_pages(epc_cg->cg);

Also, each loop here calls sgx_epc_cgroup_max_pages_to_root() and
sgx_epc_cgroup_lru_empty(), both loop the given EPC cgroup and  
descendants.  If
we still make sgx_reclaim_pages() always scan SGX_NR_TO_SCAN pages,  
seems we can
reduce the number of loops here?


[We already scan SGX_NR_TO_SCAN pages for the cgroup at the level of  
sgx_epc_cgroup_reclaim_pages().]

I think you mean that we keep scanning and reclaiming until at least  
SGX_NR_TO_SCAN pages are reclaimed as your code suggested above. We  
probably can make that a version for this background thread for  
optimization. But sgx_epc_cgroup_max_pages_to_root() and  
sgx_epc_cgroup_lru_empty() are not that bad unless we had very deep and  
wide cgroup trees. So would you agree we defer this optimization for later?


+	}
+}
+
+/**
+ * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC  
page
  * @epc_cg:	The EPC cgroup to be charged for the page.
  * Return:
  * * %0 - If successfully charged.
@@ -38,6 +209,7 @@ static void sgx_epc_cgroup_free(struct misc_cg *cg)
 	if (!epc_cg)
 		return;

+	cancel_work_sync(&epc_cg->reclaim_work);
 	kfree(epc_cg);
 }

@@ -50,6 +222,8 @@ const struct misc_res_ops sgx_epc_cgroup_ops = {

 static void sgx_epc_misc_init(struct misc_cg *cg, struct  
sgx_epc_cgroup *epc_cg)
 {
+	sgx_lru_init(&epc_cg->lru);
+	INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
 	cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
 	epc_cg->cg = cg;
 }
@@ -69,6 +243,11 @@ static int sgx_epc_cgroup_alloc(struct misc_cg *cg)

 void sgx_epc_cgroup_init(void)
 {
+	sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
+					WQ_UNBOUND | WQ_FREEZABLE,
+					WQ_UNBOUND_MAX_ACTIVE);
+	BUG_ON(!sgx_epc_cg_wq);

You cannot BUG_ON() simply due to unable to allocate a workqueue.  You  
can use
some way to mark EPC cgroup as disabled but keep going.  Static key is  
one way
although we cannot re-enable it at runtime.


Okay, I'll disable and print a log.

[...]
[workqueue related discussion in separate email]

Thanks
Haitao