On Tue, Mar 23, 2021, Borislav Petkov wrote: > On Tue, Mar 23, 2021 at 04:21:47PM +0000, Sean Christopherson wrote: > > I like the idea of pointing at the documentation. The documentation should > > probably emphasize that something is very, very wrong. > > Yap, because no matter how we formulate the error message, it still ain't enough > and needs a longer explanation. > > > E.g. if a kernel bug triggers EREMOVE failure and isn't detected until > > the kernel is widely deployed in a fleet, then the folks deploying the > > kernel probably _should_ be in all out panic. For this variety of bug > > to escape that far, it means there are huge holes in test coverage, in > > both the kernel itself and in the infrasturcture of whoever is rolling > > out their new kernel. > > You sound just like someone who works at a company with a big fleet, oh > wait... > > :-) > > And yap, you big fleeted guys will more likely catch it but we do have > all these other customers who have a handful of servers only so they > probably won't be able to do such a wide coverage. The size of the fleet shouldn't matter for this specific case. This bug requires the _host_ to be running enclaves, and obviously it also requires the system to be running SGX-enabled guests as well. In such a setup, the SGX workload running in the host should be very well defined and understood, i.e. testing should be a well-bounded problem to solve. Running enclaves in both the host and guest should be uncommon in and of itself, and for such setups, running _any_ SGX workloads in the host, let alone more than 1 or 2 unique workloads, without ensuring guests are fully isolated is, IMO, insane. But yeah, what can happen, will happen. > So I hope they'll appreciate this longer explanation about what to do > when they hit it. And normally I wouldn't even care but we almost never > tell people to reboot their boxes to fix sh*t - that's the other OS. > > Thx. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette