On 5/9/22 15:17, Borislav Petkov wrote: > >> This new ABI provides a way to avoid that situation in the first place. >> Userspace can look at sysfs to figure out which NUMA nodes support >> "encryption" (aka. TDX) and can use the existing NUMA policy ABI to >> avoid TDH.MEM.PAGE.ADD failures. >> >> So, here's the question for the TDX folks: are these mixed-capability >> systems a problem for you? Does this ABI help you fix the problem? > What I'm not really sure too is, is per-node granularity ok? I guess it > is but let me ask it anyway... I think nodes are the only sane granularity. tl;dr: Zones might work in theory but have no existing useful ABI around them and too many practical problems. Nodes are the only other real option without inventing something new and fancy. -- What about zones (or any sub-node granularity really)? Folks have, for instance, discussed adding new memory zones for this purpose: have ZONE_NORMAL, and then ZONE_UNENCRYPTABLE (or something similar). Zones are great because they have their own memory allocation pools and can be targeted directly from within the kernel using things like GFP_DMA. If you run out of ZONE_FOO, you can theoretically just reclaim ZONE_FOO. But, even a single new zone isn't necessarily good enough. What if we have some ZONE_NORMAL that's encryption-capable and some that's not? The same goes for ZONE_MOVABLE. We'd probably need at least: ZONE_NORMAL ZONE_NORMAL_UNENCRYPTABLE ZONE_MOVABLE ZONE_MOVABLE_UNENCRYPTABLE Also, zones are (mostly) not exposed to userspace. If we want userspace to be able to specify encryption capabilities, we're talking about new ABI for enumeration and policy specification. Why node granularity? First, for the majority of cases, nodes "just work". ACPI systems with an "HMAT" table already separate out different performance classes of memory into different Proximity Domains (PXMs) which the kernel maps into NUMA nodes. This means that for NVDIMMs or virtually any CXL memory regions (one or more CXL devices glued together) we can think of, they already get their own NUMA node. Those nodes have their own zones (implicitly) and can lean on the existing NUMA ABI for enumeration and policy creation. Basically, the firmware creates the NUMA nodes for the kernel. All the kernel has to do is acknowledge which of them can do encryption or not. The one place where nodes fall down is if a memory hot-add occurs within an existing node and the newly hot-added memory does not match the encryption capabilities of the existing memory. The kernel basically has two options in that case: * Throw away the memory until the next reboot where the system might be reconfigured in a way to support more uniform capabilities (this is actually *likely* for a reboot of a TDX system) * Create a synthetic NUMA node to hold it Neither one of those is a horrible option. Throwing the memory away is the most likely way TDX will handle this situation if it pops up. For now, the folks building TDX-capable BIOSes claim emphatically that such a system won't be built.