Re: Linux guest kernel threat model for Confidential Computing

Leon Romanovsky <leon@xxxxxxxxxx> · Thu, 26 Jan 2023 20:06:26 +0200

On Thu, Jan 26, 2023 at 05:48:33PM +0000, Reshetova, Elena wrote:
> 
> > * Reshetova, Elena (elena.reshetova@xxxxxxxxx) wrote:
> > > > On Wed, Jan 25, 2023 at 03:29:07PM +0000, Reshetova, Elena wrote:
> > > > > Replying only to the not-so-far addressed points.
> > > > >
> > > > > > On Wed, Jan 25, 2023 at 12:28:13PM +0000, Reshetova, Elena wrote:
> > > > > > > Hi Greg,
> > > >
> > > > <...>
> > > >
> > > > > > > 3) All the tools are open-source and everyone can start using them right
> > > > away
> > > > > > even
> > > > > > > without any special HW (readme has description of what is needed).
> > > > > > > Tools and documentation is here:
> > > > > > > https://github.com/intel/ccc-linux-guest-hardening
> > > > > >
> > > > > > Again, as our documentation states, when you submit patches based on
> > > > > > these tools, you HAVE TO document that.  Otherwise we think you all are
> > > > > > crazy and will get your patches rejected.  You all know this, why ignore
> > > > > > it?
> > > > >
> > > > > Sorry, I didn’t know that for every bug that is found in linux kernel when
> > > > > we are submitting a fix that we have to list the way how it has been found.
> > > > > We will fix this in the future submissions, but some bugs we have are found
> > by
> > > > > plain code audit, so 'human' is the tool.
> > > >
> > > > My problem with that statement is that by applying different threat
> > > > model you "invent" bugs which didn't exist in a first place.
> > > >
> > > > For example, in this [1] latest submission, authors labeled correct
> > > > behaviour as "bug".
> > > >
> > > > [1] https://lore.kernel.org/all/20230119170633.40944-1-
> > > > alexander.shishkin@xxxxxxxxxxxxxxx/
> > >
> > > Hm.. Does everyone think that when kernel dies with unhandled page fault
> > > (such as in that case) or detection of a KASAN out of bounds violation (as it is in
> > some
> > > other cases we already have fixes or investigating) it represents a correct
> > behavior even if
> > > you expect that all your pci HW devices are trusted? What about an error in
> > two
> > > consequent pci reads? What about just some failure that results in erroneous
> > input?
> > 
> > I'm not sure you'll get general agreement on those answers for all
> > devices and situations; I think for most devices for non-CoCo
> > situations, then people are generally OK with a misbehaving PCI device
> > causing a kernel crash, since most people are running without IOMMU
> > anyway, a misbehaving device can cause otherwise undetectable chaos.
> 
> Ok, if this is a consensus within the kernel community, then we can consider
> the fixes strictly from the CoCo threat model point of view. 
> 
> > 
> > I'd say:
> >   a) For CoCo, a guest (guaranteed) crash isn't a problem - CoCo doesn't
> >   guarantee forward progress or stop the hypervisor doing something
> >   truly stupid.
> 
> Yes, denial of service is out of scope but I would not pile all crashes as
> 'safe' automatically. Depending on the crash, it can be used as a
> primitive to launch further attacks: privilege escalation, information
> disclosure and corruption. It is especially true for memory corruption
> issues. 
> 
> >   b) For CoCo, information disclosure, or corruption IS a problem
> 
> Agreed, but the path to this can incorporate a number of attack 
> primitives, as well as bug chaining. So, if the bug is detected, and
> fix is easy, instead of thinking about possible implications and its 
> potential usage in exploit writing, safer to fix it.
> 
> > 
> >   c) For non-CoCo some people might care about robustness of the kernel
> >   against a failing PCI device, but generally I think they worry about
> >   a fairly clean failure, even in the unexpected-hot unplug case.
> 
> Ok.

With my other hat as a representative of hardware vendor (at least for
NIC part), who cares about quality of our devices, we don't want to hide
ANY crash related to our devices, especially if it is related to misbehaving
PCI HW logic. Any uncontrolled "robustness" hides real issues and makes
QA/customer support much harder.

Thanks