Re: [PATCH 1/2] x86/sgx: Add accounting for tracking overcommit

Borislav Petkov <bp@xxxxxxxxx> · Mon, 20 Dec 2021 22:11:33 +0100

Bah, that thread is not on lkml. Please always Cc lkml in the future.

On Mon, Dec 20, 2021 at 12:39:19PM -0800, Kristen Carlson Accardi wrote:
> If a malicious or just extra large enclave is loaded, or even just a
> lot of enclaves, it can eat up all the normal RAM on the system. Normal
> methods of finding out where all the memory on the system is being
> used, wouldn't be able to find this usage since it is shared memory. In
> addition, the OOM killer wouldn't be able to kill any enclaves.

So you need some sort of limiting against malicious enclaves anyways,
regardless of this knob. IOW, you can set a percentage limit of
per-enclave memory which cannot exceed a certain amount which won't
prevent the system from its normal operation. For example.

> I completely agree - so I'm trying to make sure I understand this
> comment, as the value is currently set to default that would
> automatically apply that is based on EPC memory present and not a fixed
> value. So I'd like to understand what you'd like to see done
> differently. are you saying you'd like to eliminated the ability to
> override the automatic default? Or just that you'd rather calculate the
> percentage based on total normal system RAM rather than EPC memory?

So you say that there are cases where swapping to normal RAM can eat
up all RAM.

So the first heuristic should be: do not allow for *all* enclaves
together to use up more than X percent of normal RAM during EPC reclaim.

X percent being, say, 90% of all RAM. For example. I guess 10% should
be enough for normal operation but someone who's more knowledgeable in
system limits could chime in here.

Then, define a per-enclave limit which says, an enclave can use Y % of
memory for swapping when trying to reclaim EPC memory. And that can be
something like:

	90 % RAM
	--------
	total amount of enclaves currently on the system

And you can obviously create scenarios where creating too many enclaves
can bring the system into a situation where it doesn't do any forward
progress.

But you probably can cause the same with overcommitting with VMs so
perhaps it would be a good idea to look how overcommitting VMs and
limits there are handled.

Bottom line is: the logic should be for the most common cases to
function properly, out-of-the-box, without knobs. And then to keep the
system operational by preventing enclaves from bringing it down to a
halt just by doing EPC reclaim.

Does that make more sense?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette