On Thu, Jan 12, 2023 at 1:33 PM Chang S. Bae <chang.seok.bae@xxxxxxxxx> wrote: > > On 1/12/2023 1:21 PM, Mingwei Zhang wrote: > > > > The only comment I would have is that it seems not following the least > > privilege principle as host process (QEMU) may not have the motivation > > to do any matrix multiplication. But this is a minor one. > > > > Since this enabling once per-process, I am wondering when after > > invocation of arch_prctl(2), all of the host threads will have a larger > > fp_state? If so, that might be a sizeable overhead since host userspace > > may have lots of threads doing various of other things, i.e., they may > > not be vCPU threads. > > No, the permission request does not immediately result in the kernel's > XSAVE buffer expansion, but only when the state is about used. As > XFD-armed, the state use will raise #NM. Then, it will reallocate the > task's fpstate via this call chain: > > #NM --> handle_xfd_event() --> xfd_enable_feature() --> fpstate_realloc() > > Thanks, > Chang Thanks for the info. But I think you are talking about host level AMX enabling. This is known to me. I am asking about how AMX was enabled by QEMU and used by vCPU threads in the guest. After digging a little bit, I think I understand it now. So, it should be the following: (in fact, the guest fp_state is not allocated lazily but at the very beginning at KVM_SET_CPUID2 time). kvm_set_cpuid() / kvm_set_cpuid2() -> kvm_check_cpuid() -> fpu_enable_guest_xfd_features() -> __xfd_enable_feature() -> fpstate_realloc() Note that KVM does intercept #NM for the guest, but only for the handling of XFD_ERR. Prior to the kvm_set_cpuid() or kvm_set_cpuid2() call, the QEMU thread should ask for permission via arch_prctl(REQ_XCOMP_GUEST_PERM) in order to become a vCPU thread. Otherwise, the above call sequence will fail. Fortunately, asking-for-guest-permission is only needed once per process (per-VM). Because of the above, the non-vCPU threads do not need to create a larger fp_state unless/until they invoke kvm_set_cpuid() or kvm_set_cpuid2(). Now, I think that closes the loop for me. Thanks. -Mingwei