Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Mon, 17 May 2021 11:45:00 +0200

Len,

On Sun, May 02 2021 at 11:27, Len Brown wrote:
> Here is how it works:
>
> 1. The kernel boots and sees the feature in CPUID.
>
> 2. If the kernel supports that feature, it sets XCR0[feature].
>
>     For some features, there may be a bunch of kernel support,
>     while simple features may require only state save/restore.
>
> 2a.  If the kernel doesn't support the feature, XCR0[feature] remains cleared.
>
> 3. user-space sees the feature in CPUID
>
> 4. user-space sees for the feature via xgetbv[XCR0]
>
> 5. If the feature is enabled in XCR0, the user happily uses it.
>
>     For AMX, Linux implements "transparent first use"
>     so that it doesn't have to allocate 8KB context switch
>     buffers for tasks that don't actually use AMX.
>     It does this by arming XFD for all tasks, and taking a #NM
>     to allocate a context switch buffer only for those tasks
>     that actually execute AMX instructions.

I thought more about this and it's absolutely the wrong way to go for
several reasons.

AMX (or whatever comes next) is nothing else than a device and it
just should be treated as such. The fact that it is not exposed
via a driver and a device node does not matter at all.

Not doing so requires this awkward buffer allocation issue via #NM with
all it's downsides; it's just wrong to force the kernel to manage
resources of a user space task without being able to return a proper
error code. 

It also prevents fine grained control over access to this
functionality. As AMX is clearly a shared resource which is not per HT
thread (maybe not even per core) and it has impact on power/frequency it
is important to be able to restrict access on a per process/cgroup
scope.

Having a proper interface (syscall, prctl) which user space can use to
ask for permission and allocation of the necessary buffer(s) is clearly
avoiding the downsides and provides the necessary mechanisms for proper
control and failure handling.

It's not the end of the world if something which wants to utilize this
has do issue a syscall during detection. It does not matter whether
that's a library or just the application code itself.

That's a one off operation and every involved entity can cache the
result in TLS.

AVX512 has already proven that XSTATE management is fragile and error
prone, so we really have to stop this instead of creating yet another
half baken solution.

Thanks,

        tglx