Hi Dave,
On Mon, 26 Feb 2024 08:04:54 -0600, Dave Hansen <dave.hansen@xxxxxxxxx>
wrote:
On 2/26/24 03:36, Huang, Kai wrote:
In case of overcomitting, even if we always reclaim from the same
cgroup
for each fault, one group may still interfere the other: e.g.,
consider an
extreme case in that group A used up almost all EPC at the time group B
has a fault, B has to fail allocation and kill enclaves.
If the admin allows group A to use almost all EPC, to me it's fair to
say he/she
doesn't want to run anything inside B at all and it is acceptable
enclaves in B
to be killed.
Folks, I'm having a really hard time following this thread. It sounds
like there's disagreement about when to do system-wide reclaim. Could
someone remind me of the choices that we have? (A proposed patch would
go a _long_ way to helping me understand)
In case of overcomitting, i.e., sum of limits greater than the EPC
capacity, if one group has a fault, and its usage is not above its own
limit (try_charge() passes), yet total usage of the system has exceeded
the capacity, whether we do global reclaim or just reclaim pages in the
current faulting group.
Also, what does the core mm memcg code do?
I'm not sure. I'll try to find out but it'd be appreciated if someone more
knowledgeable can comment on this. memcg also has the protection mechanism
(i.e., min, low settings) to guarantee some allocation per group so its
approach might not be applicable to misc controller here.
Last, what is the simplest (least amount of code) thing that the SGX
cgroup controller could implement here?
I still think the current approach of doing global reclaim is reasonable
and simple: try_charge() checks cgroup limit and reclaim within the group
if needed, then do EPC page allocation, reclaim globally if allocation
fails due to global usage reaches the capacity.
I'm not sure how not doing global reclaiming in this case would bring any
benefit. Please see my response to Kai's example cases.
Thanks
Haitao