On Sat, Nov 13, 2021 at 10:35 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Fri, Nov 12, 2021, Marc Orr wrote: > > > > > If *it* is the host kernel, then you probably shouldn't do that - > > > > > otherwise you just killed the host kernel on which all those guests are > > > > > running. > > > > > > > > I agree, it seems better to terminate the single guest with an issue. > > > > Rather than killing the host (and therefore all guests). So I'd > > > > suggest even in this case we do the 'convert to shared' approach or > > > > just outright terminate the guest. > > > > > > > > Are there already examples in KVM of a KVM bug in servicing a VM's > > > > request results in a BUG/panic/oops? That seems not ideal ever. > > > > > > Plenty of examples. kvm_spurious_fault() is the obvious one. Any NULL pointer > > > deref will lead to a BUG, etc... And it's not just KVM, e.g. it's possible, if > > > unlikely, for the core kernel to run into guest private memory (e.g. if the kernel > > > botches an RMP change), and if that happens there's no guarantee that the kernel > > > can recover. > > > > > > I fully agree that ideally KVM would have a better sense of self-preservation, > > > but IMO that's an orthogonal discussion. > > > > I don't think we should treat the possibility of crashing the host > > with live VMs nonchalantly. It's a big deal. Doing so has big > > implications on the probability that any cloud vendor wil bee able to > > deploy this code to production. And aren't cloud vendors one of the > > main use cases for all of this confidential compute stuff? I'm > > honestly surprised that so many people are OK with crashing the host. > > I'm not treating it nonchalantly, merely acknowledging that (a) some flavors of kernel > bugs (or hardware issues!) are inherently fatal to the system, and (b) crashing the > host may be preferable to continuing on in certain cases, e.g. if continuing on has a > high probablity of corrupting guest data. I disagree. Crashing the host -- and _ALL_ of its VMs (including non-confidential VMs) -- is not preferable to crashing a single SNP VM. Especially when that SNP VM is guaranteed to detect the memory corruption and react accordingly.