Re: [PATCH v15 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

"Haitao Huang" <haitao.huang@xxxxxxxxxxxxxxx> · Thu, 20 Jun 2024 10:06:01 -0500




Hi Kai

On Thu, 20 Jun 2024 05:30:16 -0500, Huang, Kai <kai.huang@xxxxxxxxx> wrote:


On 18/06/2024 12:53 am, Huang, Haitao wrote:
From: Kristen Carlson Accardi <kristen@xxxxxxxxxxxxxxx>

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_epc_page_lru() always
returns reference to the global LRU.

Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
given EPC page is allocated.

This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., the sum of cgroup
limits greater than the total capacity, cgroups may never reclaim but
the total usage can still be near the capacity. Therefore a global
reclamation is still needed in those cases and it should be performed
from the root cgroup.

Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled. Similar to sgx_cgroup_reclaim_pages(), return
the next cgroup so callers can use it as the new starting node for next
round of reclamation if needed.

Also update sgx_can_reclaim_global(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.

Finally, change sgx_reclaim_direct(), to check and ensure there are free
pages at cgroup level so forward progress can be made by the caller.

Reading above, it's not clear how the _new_ global reclaim works with
multiple LRUs.

E.g., the current global reclaim essentially treats all EPC pages equally
when scanning those pages.  From the above, I don't see how this is
achieved in the new global reclaim.

The changelog should:

1) describe the how does existing global reclaim work, and then describe
how to achieve the same beahviour in the new global reclaim which works
with multiple LRUs;

2) If there's any behaviour difference between the "existing" vs the  
"new"
global reclaim, the changelog should point out the difference, and  
explain
why such difference is OK.

Sure I can explain. here is what I plan to add for v16:

Note the original global reclaimer has
only one LRU and always scans and reclaims from the head of this global
LRU. The new global reclaimer always starts the scanning from the root
node, only moves down to its descendants if more reclamation is needed
or the root node does not have SGX_NR_TO_SCAN (16) pages in the LRU.
This makes the enclave pages in the root node more likely being
reclaimed if they are not frequently used (not 'young'). Unless we track
pages in one LRU again, we can not really match exactly the same
behavior of the original global reclaimer. And one-LRU approach would
make per-cgroup reclamation scanning and reclaiming too complex.  On the
other hand, this design is acceptable for following reasons:

1) For all purposes of using cgroups, enclaves will need live in
     non-root (leaf for cgroup v2) nodes where limits can be enforced
     per-cgroup.
2) Global reclamation now only happens in situation mentioned above when
     a lower level cgroup not at its limit can't allocate due to over
     commit at global level.
3) The pages in root being slightly penalized are not busily used
     anyway.
4) In cases that multiple rounds of reclamation is needed, the caller of
     sgx_reclaim_page_global() can still recall for reclaiming in 'next'
     descendant in round robin way, each round scans for SGX_NR_SCAN pages
     from the head of 'next' cgroup's LRU.




With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.


[...]


-static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
+static struct misc_cg *sgx_reclaim_pages_global(struct misc_cg  
*next_cg,
+						struct mm_struct *charge_mm)
  {
+	if (IS_ENABLED(CONFIG_CGROUP_MISC))
+		return sgx_cgroup_reclaim_pages(misc_cg_root(), next_cg, charge_mm);
+
  	sgx_reclaim_pages(&sgx_global_lru, charge_mm);
+	return NULL;
  }

  /*
@@ -414,12 +443,35 @@ static void sgx_reclaim_pages_global(struct  
mm_struct *charge_mm)
   */
  void sgx_reclaim_direct(void)
  {
+	struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
+	struct misc_cg *cg = misc_from_sgx(sgx_cg);

From below @sgx_cg could be NULL.  It's not immediately clear whether  
calling
misc_from_sgx(sgx_cg) unconditionally is safe here.

Leave the initiaization of @cg to a later phase where @sgx_cg is
guaranteed not being NULL, or initialize @cg to NULL here and update  
later.


Ok

+	struct misc_cg *next_cg = NULL;
+
+	/*
+	 * Make sure there are some free pages at both cgroup and global  
levels.
+	 * In both cases, only make one attempt of reclamation to avoid  
lengthy
+	 * block on the caller.
+	 */
+	if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg))
+		next_cg = sgx_cgroup_reclaim_pages(cg, next_cg, current->mm);

I don't quite follow the logic.

First of all, sgx_cgroup_reclaim_pages() isn't called in a loop, so why
not just do:

	next_cg = sgx_cgroup_reclaim_pages(cg, NULL, current->mm);


Okay, I can replace it with NULL. I thought it was convenient to grep all  
such calls to see if they are used consistently.

And what is the point of set @next_cg here, since ...


+
+	if (next_cg != cg)
+		put_misc_cg(next_cg);
+
+	next_cg = NULL;

... here @next_cg is reset to NULL ?

Looks the only reason is you need to do ...

	put_misc_cg(next_cg);

... above?


yes

This piece of code appears repeatedly in this file.  Is there any way we
can get rid of it?


sgx_cgroup_reclaim_pages() returns 'next' in case caller needs to run it  
in loop. In this case, no.

I could create a thin wrapper for this case, but since there is only one  
such case, I did not fee it's worth the extra layer.

  	if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
-		sgx_reclaim_pages_global(current->mm);
+		next_cg = sgx_reclaim_pages_global(next_cg, current->mm);

And this doesn't seems "global reclaim" at all?

Because it essentially equals to:

	next_cg = sgx_reclaim_pages_global(NULL, current->mm);

which always reclaims from the ROOT.

So each call to sgx_reclaim_direct() will always reclaim from the ROOT --
any other LRUs in the hierarchy will unlikely to get any chance to be
reclaimed.

No, the current cgroup is always checked and called to reclaim if needed.
Global reclaim only happens when system-wide water mark SGX_NR_LOW_PAGES  
is violated.


+
+	if (next_cg != misc_cg_root())
+		put_misc_cg(next_cg);
+
+	sgx_put_cg(sgx_cg);
  }

  static int ksgxd(void *p)
  {
+	struct misc_cg *next_cg = NULL;
+
  	set_freezable();

  	/*
@@ -437,11 +489,15 @@ static int ksgxd(void *p)
  				     kthread_should_stop() ||
  				     sgx_should_reclaim_global(SGX_NR_HIGH_PAGES));

-		if (sgx_should_reclaim_global(SGX_NR_HIGH_PAGES))
+		while (!kthread_should_stop() &&  
sgx_should_reclaim_global(SGX_NR_HIGH_PAGES)) {
  			/* Indirect reclaim, no mm to charge, so NULL: */
-			sgx_reclaim_pages_global(NULL);
+			next_cg = sgx_reclaim_pages_global(next_cg, NULL);
+			cond_resched();

Should the 'put_misc_cg(next_cg)' be done within the while() loop but not
below?

No, whenever there is a loop, we want to reclaim in round robin way.

+		}

-		cond_resched();
+		if (next_cg != misc_cg_root())
+			put_misc_cg(next_cg);
+		next_cg = NULL;

Again, it doesn't seems "global reclaim" here, since you always restart
from the ROOT once the target pages have been reclaimed.

AFAICT it's completely different from the existing global reclaim.

Hopefully adding comments mentioned above clarifies this.

  	}

  	return 0;
@@ -583,6 +639,7 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page  
*page)
   */
  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim  
reclaim)
  {
+	struct misc_cg *next_cg = NULL;
  	struct sgx_cgroup *sgx_cg;
  	struct sgx_epc_page *page;
  	int ret;
@@ -616,10 +673,19 @@ struct sgx_epc_page *sgx_alloc_epc_page(void  
*owner, enum sgx_reclaim reclaim)
  			break;
  		}

-		sgx_reclaim_pages_global(current->mm);
+		/*
+		 * At this point, the usage within this cgroup is under its
+		 * limit but there is no physical page left for allocation.
+		 * Perform a global reclaim to get some pages released from any
+		 * cgroup with reclaimable pages.
+		 */
+		next_cg = sgx_reclaim_pages_global(next_cg, current->mm);
  		cond_resched();
  	}

Ditto IIUC.