"Huang, Kai" <kai.huang@xxxxxxxxx> writes: > On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote: >> On 1/10/23 04:09, Huang, Kai wrote: >> > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote: >> > > On 1/9/23 03:48, Huang, Kai wrote: >> > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX >> > > > > > > > memory to a separate NUMA node. In this case, the "TDX-capable" nodes >> > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace >> > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from >> > > > > > > > the "TDX-capable" nodes. >> > > > > > >> > > > > > Why does it need to be enhanced? What's the problem? >> > > > >> > > > The problem is after TDX module initialization, no more memory can be hot-added >> > > > to the page allocator. >> > > > >> > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can >> > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can >> > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can >> > > > utilize all memory. >> > > > >> > > > But probably it is not necessarily to call out in the changelog? >> > > >> > > Let's say that we add this TDX-compatible-node ABI in the future. What >> > > will old code do that doesn't know about this ABI? >> > >> > Right. The old app will break w/o knowing the new ABI. One resolution, I >> > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non- >> > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating >> > TDX guest memory from those "TDX-capable" nodes. >> >> That doesn't actually hide all of the behavior from users. Let's say >> they do: >> >> numactl --membind=6 qemu-kvm ... >> >> In other words, take all of this guest's memory and put it on node 6. >> There lots of free memory on node 6 which is TDX-*IN*compatible. Then, >> they make it a TDX guest: >> >> numactl --membind=6 qemu-kvm -tdx ... >> >> What happens? Does the kernel silently ignore the --membind=6? Or does >> it return -ENOMEM somewhere and confuse the user who has *LOTS* of free >> memory on node 6. >> >> In other words, I don't think the kernel can just enforce this >> internally and hide it from userspace. > > IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX > guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that > the memory will always be allocated from those "TDX-capable" nodes. KVM can > refuse to create the TDX guest if it found task's mempolicy doesn't have any > "TDX-capable" node, and print out a clear message to the userspace. > > But I am new to the core-mm, so I might have some misunderstanding. KVM here means in-kernel KVM module? If so, KVM can only output some message in dmesg. Which isn't very good for users to digest. It's better for the user space QEMU to detect whether current configuration is usable and respond to users, via GUI, or syslog, etc. Best Regards, Huang, Ying