On Mon, 22 Mar 2021 12:37:02 -0700 Sean Christopherson wrote: > On Mon, Mar 22, 2021, Borislav Petkov wrote: > > On Mon, Mar 22, 2021 at 11:56:37AM -0700, Sean Christopherson wrote: > > > Not necessarily. This can only trigger in the host, and thus require a host > > > reboot, if the host is also running enclaves. If the CSP is not running > > > enclaves, or is running its enclaves in a separate VM, then this path cannot be > > > reached. > > > > That's what I meant. Rebooting guests is a lot easier, ofc. > > > > Or are you saying, this can trigger *only* when they're running enclaves > > on the *host* too? > > Yes. Note, it's still true if you strike out the "too", KVM support is completely > orthogonal to this code. The purpose of this patch is to separate out the EREMOVE > path used for host enclaves (/dev/sgx_enclave), because EPC virtualization for > KVM will have non-buggy scenarios where EREMOVE can fail. But the virt EPC code > is designed to handle that gracefully. > > > > EREMOVE can only fail if there's a kernel or hardware bug (or a VMM bug if > > > running as a guest). > > > > We get those on a daily basis. > > > > > IME, nearly every kernel/KVM bug that I introduced that led to EREMOVE > > > failure was also quite fatal to SGX, i.e. this is just the canary in > > > the coal mine. > > > > > > It's certainly possible to add more sophisticated error handling, e.g. through > > > the pages onto a list and periodically try to recover them. But, since the vast > > > majority of bugs that cause EREMOVE failure are fatal to SGX, implementing > > > sophisticated handling is quite low on the list of priorities. > > > > > > Dave wanted the "page leaked" error message so that it's abundantly clear that > > > the kernel is leaking pages on EREMOVE failure and that the WARN isn't "benign". > > > > So this sounds to me like this should BUG too eventually. > > > > Or is this one of those "this should never happen" things so no one > > should worry? > > Hmm. I don't think it warrants BUG. At worst, leaking EPC pages is fatal only > to SGX. If the underlying bug caused other fallout, e.g. didn't release a lock, > then obviously that could be fatal to the kernel. But I don't think there's > ever a case where SGX being unusuable would prevent the kernel from functioning. > > > Whatever it is, if an admin sees this message in dmesg and doesn't get a > > lengthy explanation what she/he is supposed to do, I don't think she/he > > will be as relaxed. > > > > Hell, people open bugs for correctable ECCs and are asking whether they > > need to replace their hardware. > > LOL. > > > So let's play this out: put yourself in an admin's shoes and tell me how > > should an admin react when she/he sees that? > > > > Should the kernel probably also say: "Don't worry, you have enough > > memory and what's a 4K, who cares? You'll reboot eventually." > > > Or should the kernel say "You need to reboot ASAP." > > > > And so on... > > > > So what is the scenario here and what kind of reaction is that message > > supposed to cause, recovery action, blabla, the whole spiel? > > Probably something in between. Odds are good SGX will eventually become > unusuable, e.g. either kernel SGX support is completely hosted, or it will soon > leak the majority of EPC pages. Something like this? > > "EREMOVE returned %d (0x%x), kernel bug likely. EPC page leaked, SGX may become unusuable. Reboot recommended to continue using SGX." Or perhaps just stick to old behavior in original sgx_free_epc_page()? ret = __eremove(sgx_get_epc_virt_addr(page)); if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret)) return; This code path is only used by host SGX driver, but not KVM. And this patch's *main* intention is to break EREMOVE out of sgx_free_epc_page() so virtual EPC code can use sgx_free_epc_page(). Improving the error msg can be a separate discussion and separate patch which can be done in the future, and this has nothing to do with SGX virtualization support.