Re: [PATCH v3 03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()

Kai Huang <kai.huang@xxxxxxxxx> · Tue, 23 Mar 2021 11:06:43 +1300

On Mon, 22 Mar 2021 22:06:45 +0100 Borislav Petkov wrote:
> On Mon, Mar 22, 2021 at 12:37:02PM -0700, Sean Christopherson wrote:
> > Yes.  Note, it's still true if you strike out the "too", KVM support is completely
> > orthogonal to this code.  The purpose of this patch is to separate out the EREMOVE
> > path used for host enclaves (/dev/sgx_enclave), because EPC virtualization for
> > KVM will have non-buggy scenarios where EREMOVE can fail.  But the virt EPC code
> > is designed to handle that gracefully.
> 
> "gracefully" as it won't leak EPC pages which would require a host reboot? That
> leaking is done by host enclaves only?

This path is called by host SGX driver only, so yes this leaking is done by
host enclaves only.

This patch is purpose is to break EREMOVE out of sgx_free_epc_page() so virtual
EPC code can use sgx_free_epc_page(), and handle EREMOVE logic differently.
This patch doesn't have intention to bring functional change.  I changed the
error msg because Dave said it worth to give a message saying EPC page is
leaked, and I thought this minor change won't break anything.

Perpahps we can avoid changing error message but stick to existing SGX driver
behavior? 

> 
> > Hmm.  I don't think it warrants BUG.  At worst, leaking EPC pages is fatal only
> > to SGX.
> 
> Fatal how? If it keeps leaking, at some point it won't have any pages
> for EPC pages anymore?
> 
> Btw, I probably have seen this and forgotten again so pls remind me,
> is the amount of pages available for SGX use static and limited by,
> I believe BIOS, or can a leakage in EPC pages cause system memory
> shortage?
> 
> > If the underlying bug caused other fallout, e.g. didn't release a
> > lock, then obviously that could be fatal to the kernel. But I don't
> > think there's ever a case where SGX being unusuable would prevent the
> > kernel from functioning.
> 
> This kinda replies my question above but still...
> 
> > Probably something in between.  Odds are good SGX will eventually become
> > unusuable, e.g. either kernel SGX support is completely hosted, or it will soon
> > leak the majority of EPC pages.  Something like this?
> > 
> >   "EREMOVE returned %d (0x%x), kernel bug likely.  EPC page leaked, SGX may become unusuable.  Reboot recommended to continue using SGX."
> 
> So all this handwaving I'm doing is to provoke a proper response from
> you guys as to how a EPC page leaking is supposed to be handled by the
> users of the technology:
> 
> 1. Issue a warning message and forget about it, eventual reboot

This is the existing SGX driver behavior IMHO. It just gives a WARN() saying
EREMOVE failed with the error code printed.  The EPC page is leaked w/o any
message to user.  I can live with it, and it is existing code anyway.

btw, currently virtual EPC code (patch 5) handles in similar way: There's one
EREMOVE error is expected and virtual EPC code can handle, but for other
errors, it means kernel bug, and virtual EPC code gives a WARN(), and that EPC
page is leaked too:

+		WARN_ONCE(ret != SGX_CHILD_PRESENT,
+			  "EREMOVE (EPC page 0x%lx): unexpected error: %d\n",
+			  sgx_get_epc_phys_addr(epc_page), ret);
+		return ret;

So to me they are just WARN() to catch kernel bug.

> 
> 2. Really scary message to make users reboot sooner
> 
> 3. Detect when host enclaves are run while guest enclaves are running
> and issue a warning then.

This code path has nothing to do with guest enclaves.

> 
> 4. Fall on knees and pray to not get sued by customers because their
> enclaves are not working anymore.
> 
> ....
> 
> Btw, 4. needs to be considered properly so that people can cover asses.

If we are talking about CSPs being unable to provide correct services to
customers due to kernel bug, I think this is a bigger question but not
just related to SGX,  since other kernel bug can also cause similar problem, for
instance, VM or SGX process itself being killed.

> 
> Oh and whatever we end up deciding, we should document that in
> Documentation/... somewhere and point users to it in that warning
> message where a longer treatise is explaining the whole deal properly.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette