Re: x86 CPU features detection for applications (and AMX)

"Enrico Weigelt, metux IT consult" <lkml@xxxxxxxxx> · Wed, 30 Jun 2021 14:50:30 +0200

On 28.06.21 15:20, Peter Zijlstra wrote:

And one point that immediately jumps into my mind (w/o looking deeper
into it): it introduces completely new registers - do we now need extra
code for tasks switching etc ?

No, but because it's register state and part of XSAVE, it has immediate
impact in ABI. In particular, the signal stack layout includes XSAVE (as
does ptrace()).

OMGs, I've already suspected such sickness. I don't even dare thinking
about consequences for compilers and library ABIs.

Does anyone here know why they designed this as inline operations ? This
thing seems to be pretty much what typical TPUs are doing (or a subset
of it). Why not just adding a TPU next to the CPU on the same chip ?

We already have the same w/ GPUs, and I guess nobody seriously wants to
put GPU functionality directly into CPU.

At the same time, 'legacy' applications (up until _very_ recently) had a
minimum signal stack size of 2K, which is already violated by the
addition of AVX512 (there's actual breakage due to that).

grmpf!

Adding the insane AMX state (8k+) into that is a complete trainwreck
waiting to happen. Not to mention that having !INIT AMX state has direct
consequences for P-state selection and thus performance.

Uh, are those new registers retained in certain sleep states or do they
need to be saved somewhere ?

For these reasons, us OS folks, will mandate you get to do a prctl() to
request/release AMX (and we get to say: no). If you use AMX without
this, the instruction will fault (because not set in XCR0) and we'll
SIGBUS or something.

Userspace will have to do something like:

  - check CPUID, if !AMX -> fail
  - issue prctl(), if error -> fail
  - issue XGETBV and check the AMX bit it set, if not -> fail

Can't we to this just by prctl() call ?
IOW: ask the kernel, who gonna say yes or no.

Are there any situations where kernel says yes, but process still can't
use it ? Why so ?

  - request the signal stack size / spawn threads

Signal stack is separate from the usual stack, right ?
Why can't this all be done in one shot ?

  - use AMX

Spawning threads prior to enabling AMX will result in using the wrong
signal stack size and result in malfunction, you get to keep the pieces.

No way of adjusting this once the threads are running ?
Or could we even do that on per-thread basis ?

A thread here always has a corresponding kernel task, correct ?

--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@xxxxxxxxx -- +49-151-27565287