Re: [PATCH v3 03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()

Borislav Petkov <bp@xxxxxxxxx> · Mon, 22 Mar 2021 20:15:40 +0100

On Mon, Mar 22, 2021 at 11:56:37AM -0700, Sean Christopherson wrote:
> Not necessarily.  This can only trigger in the host, and thus require a host
> reboot, if the host is also running enclaves.  If the CSP is not running
> enclaves, or is running its enclaves in a separate VM, then this path cannot be
> reached.

That's what I meant. Rebooting guests is a lot easier, ofc.

Or are you saying, this can trigger *only* when they're running enclaves
on the *host* too?

> EREMOVE can only fail if there's a kernel or hardware bug (or a VMM bug if
> running as a guest). 

We get those on a daily basis.

> IME, nearly every kernel/KVM bug that I introduced that led to EREMOVE
> failure was also quite fatal to SGX, i.e. this is just the canary in
> the coal mine.
>
> It's certainly possible to add more sophisticated error handling, e.g. through
> the pages onto a list and periodically try to recover them.  But, since the vast
> majority of bugs that cause EREMOVE failure are fatal to SGX, implementing
> sophisticated handling is quite low on the list of priorities.
> 
> Dave wanted the "page leaked" error message so that it's abundantly clear that
> the kernel is leaking pages on EREMOVE failure and that the WARN isn't "benign".

So this sounds to me like this should BUG too eventually.

Or is this one of those "this should never happen" things so no one
should worry?

Whatever it is, if an admin sees this message in dmesg and doesn't get a
lengthy explanation what she/he is supposed to do, I don't think she/he
will be as relaxed.

Hell, people open bugs for correctable ECCs and are asking whether they
need to replace their hardware.

So let's play this out: put yourself in an admin's shoes and tell me how
should an admin react when she/he sees that?

Should the kernel probably also say: "Don't worry, you have enough
memory and what's a 4K, who cares? You'll reboot eventually."

Or should the kernel say "You need to reboot ASAP."

And so on...

So what is the scenario here and what kind of reaction is that message
supposed to cause, recovery action, blabla, the whole spiel?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette