Re: [PATCH v3 00/21] TDX host kernel support

Dan Williams <dan.j.williams@xxxxxxxxx> · Fri, 29 Apr 2022 10:48:06 -0700

On Fri, Apr 29, 2022 at 10:18 AM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 4/29/22 08:18, Dan Williams wrote:
> > Yes, I want to challenge the idea that all core-mm memory must be TDX
> > capable. Instead, this feels more like something that wants a
> > hugetlbfs / dax-device like capability to ask the kernel to gather /
> > set-aside the enumerated TDX memory out of all the general purpose
> > memory it knows about and then VMs use that ABI to get access to
> > convertible memory. Trying to ensure that all page allocator memory is
> > TDX capable feels too restrictive with all the different ways pfns can
> > get into the allocator.
>
> The KVM users are the problem here.  They use a variety of ABIs to get
> memory and then hand it to KVM.  KVM basically just consumes the
> physical addresses from the page tables.
>
> Also, there's no _practical_ problem here today.  I can't actually think
> of a case where any memory that ends up in the allocator on today's TDX
> systems is not TDX capable.
>
> Tomorrow's systems are going to be the problem.  They'll (presumably)
> have a mix of CXL devices that will have varying capabilities.  Some
> will surely lack the metadata storage for checksums and TD-owner bits.
> TDX use will be *safe* on those systems: if you take this code and run
> it on one tomorrow's systems, it will notice the TDX-incompatible memory
> and will disable TDX.
>
> The only way around this that I can see is to introduce ABI today that
> anticipates the needs of the future systems.  We could require that all
> the KVM memory be "validated" before handing it to TDX.  Maybe a new
> syscall that says: "make sure this mapping works for TDX".  It could be
> new sysfs ABI which specifies which NUMA nodes contain TDX-capable memory.

Yes, node-id seems the only reasonable handle that can be used, and it
does not seem too onerous for a KVM user to have to set a node policy
preferring all the TDX / confidential-computing capable nodes.

> But, neither of those really help with, say, a device-DAX mapping of
> TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
> throw up its hands and leave users with the same result: TDX can't be
> used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> device-DAX because they don't respect the NUMA policy ABI.

They do have "target_node" attributes to associate node specific
metadata, and could certainly express target_node capabilities in its
own ABI. Then it's just a matter of making pfn_to_nid() do the right
thing so KVM kernel side can validate the capabilities of all inbound
pfns.

> I'm open to ideas here.  If there's a viable ABI we can introduce to
> train TDX users today that will work tomorrow too, I'm all for it.

In general, expressing NUMA node perf and node capabilities is
something Linux needs to get better at. HMAT data for example still
exists as sideband information ignored by numactl, but it feels
inevitable that perf and capability details become more of a first
class citizen for applications that have these mem-allocation-policy
constraints in the presence of disparate memory types.