Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

Dave Hansen <dave.hansen@xxxxxxxxx> · Tue, 20 Jun 2023 09:21:38 -0700

On 6/20/23 09:03, David Hildenbrand wrote:
> On 20.06.23 17:39, Dave Hansen wrote:
>> On 6/19/23 05:21, David Hildenbrand wrote:
>>> So, ordinary writes to TD private memory are not a problem? I thought
>>> one motivation for the unmapped-guest-memory discussion was to prevent
>>> host (userspace) writes to such memory because it would trigger a MC and
>>> eventually crash the host.
>>
>> Those are two different problems.
>>
>> Problem #1 (this patch): The host encounters poison when going about its
>> normal business accessing normal memory.  This happens when something in
>> the host accidentally clobbers some TDX memory and *then* reads it.
>> Only occurs with partial writes.
>>
>> Problem #2 (addressed with unmapping): Host *userspace* intentionally
>> and maliciously clobbers some TDX memory and then the TDX module or a
>> TDX guest can't run because the memory integrity checks (checksum or TD
>> bit) fail.  This can also take the system down because #MC's are nasty.
>>
>> Host userspace unmapping doesn't prevent problem #1 because it's the
>> kernel who screwed up with the _kernel_ mapping.
> 
> Ahh, thanks for verifying. I was hoping that problem #2 would get fixed
> in HW as well (and treated like a BUG).

No, it's really working as designed.

#1 _can_ be fixed because the hardware can just choose to let the host
run merrily along corrupting TDX data and blissfully unaware of the
carnage until TDX stumbles on the mess.  Blissful ignorance really is a
useful feature here.  It means, for instance, that if the kernel screws
up, it can still blissfully kexec(), reboot , boot a new kernel, or dump
to the console without fear of #MC.

#2 is much harder because the TDX data is destroyed and yet the TDX side
still wants to run.  The SEV folks chose page faults on write to stop
SEV from running and the TDX folks chose #MC on reads as the mechanism.

All of the nastiness on the TDX side is (IMNHO) really a consequence of
that decision to use machine checks.

(Aside: I'm not specifically crapping on the TDX CPU designers here.  I
 don't particularly like the SEV approach either.  But this mess is a
 result of the TDX design choices.  There are other messes in other
 patch series from SEV. )

> Because problem #2 also sounds like something that directly violates the
> first paragraph of this patch description "violations of
> this integrity protection are supposed to only affect TDX operations and
> are never supposed to affect the host kernel itself."
> 
> So I would expect the TDX guest to fail hard, but not other TDX guests
> (or the host kernel).

This is more fallout from the #MC design choice.

Let's use page faults as an example since our SEV friends are using
them.  *ANY* instruction that reads memory can page fault, have the
kernel fix up the fault, and continue merrily along its way.

#MC is fundamentally different.  The exceptions can be declared to be
unrecoverable.  The CPU says, "whoopsie, I managed to deliver this #MC,
but it would be too hard for me so I can't continue."  These "too hard"
scenarios are shrinking over time, but they do exist.  They're fatal.