Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

"Haitao Huang" <haitao.huang@xxxxxxxxxxxxxxx> · Mon, 26 Feb 2024 15:48:18 -0600

Hi Dave,

On Mon, 26 Feb 2024 08:04:54 -0600, Dave Hansen <dave.hansen@xxxxxxxxx>  
wrote:

On 2/26/24 03:36, Huang, Kai wrote:
In case of overcomitting, even if we always reclaim from the same  
cgroup
for each fault, one group may still interfere the other: e.g.,  
consider an
extreme case in that group A used up almost all EPC at the time group B
has a fault, B has to fail allocation and kill enclaves.
If the admin allows group A to use almost all EPC, to me it's fair to  
say he/she
doesn't want to run anything inside B at all and it is acceptable  
enclaves in B
to be killed.

Folks, I'm having a really hard time following this thread.  It sounds
like there's disagreement about when to do system-wide reclaim.  Could
someone remind me of the choices that we have?  (A proposed patch would
go a _long_ way to helping me understand)

In case of overcomitting, i.e., sum of limits greater than the EPC  
capacity, if one group has a fault, and its usage is not above its own  
limit (try_charge() passes), yet total usage of the system has exceeded  
the capacity, whether we do global reclaim or just reclaim pages in the  
current faulting group.

Also, what does the core mm memcg code do?

I'm not sure. I'll try to find out but it'd be appreciated if someone more  
knowledgeable can comment on this. memcg also has the protection mechanism  
(i.e., min, low settings) to guarantee some allocation per group so its  
approach might not be applicable to misc controller here.

Last, what is the simplest (least amount of code) thing that the SGX
cgroup controller could implement here?

I still think the current approach of doing global reclaim is reasonable  
and simple: try_charge() checks cgroup limit and reclaim within the group  
if needed, then do EPC page allocation, reclaim globally if allocation  
fails due to global usage reaches the capacity.

I'm not sure how not doing global reclaiming in this case would bring any  
benefit. Please see my response to Kai's example cases.

Thanks
Haitao