Re: x86 CPU features detection for applications (and AMX)

"Enrico Weigelt, metux IT consult" <lkml@xxxxxxxxx> · Mon, 28 Jun 2021 14:40:32 +0200

On 26.06.21 01:31, Thiago Macieira wrote:

Hi folks,

The first problem is the cross-platformness need. Because we library and
application developers need to support other OSes, we'll need to deploy our
own CPUID-based detection. It's far better to use common code everywhere,
where one developer working on Linux can fix bugs in FreeBSD, macOS or Windows
or any of the permutations. Every platform-specific deviation adds to
maintenance requirements and is a source of potential latent bugs, now or in
the future due to refactoring. That is why doing everything in the form of
instructions would be far better and easier, rather than system calls.

hmm, maybe some libcpuid ?

Personally, I believe that glibc already is too big, too much things
that belong into a separate library.

The second problem is going to be backwards compatibility. Applications and
libraries may want to ship precompiled binaries that make use of the new CPU
features, whether they are open source or not. 

Since we're talking about GNU libc here, binary-only stuff is probably
out of scope here. OTOH, using differnt libc versions in those special
cases isn't such a big deal.

OTOH, depending on the programming language, those things might be
better put into the compiler. (Of course that needs good coordination
between different compiler vendors.)

Why do I believe so ?

a) If you really want optimal performance, it's not just about whether
   you can call certain opcode (and have a fallback if the opcode's
   missing), also the order matters.

   For example, back in P4 times, and old friend of mine (raytracing
   expert) did in-depth measurement of the P4's timing behaviour
   depending on opcode ordering, and he managed to achieve 5..10x (!)
   performance improve on raytracing workloads. But this was only
   possible by hand written assembler code, a C compiler wouldn't be
   able to do that (we actually designed some DSL for that, so a special
   compiler can generate the optional CPU specific code)

b) In recent years we're moving more and more to higher level (but still
   compiling to machine code) languages, where writing such machine
   specific code is either very hard or practically impossible - it's
   just the duty to handle such things.

It comes as no surprise to anyone that we CPU makers will have made software 
that use those features and
want to have it ready on Day 1 of the HW being available for the market (if
we're doing our jobs right). That often involves precompiling because everyone > who installed their compilers more than one year ago will not have the
necessary tools to build. 

Actually, you should talk to the compiler folks much more early, at the
point where you know how those features look like.

Of course, there are cases where folks can't upgrade their compilers
easily due to certain formal processes. But those are just special case,
they've got tons of other problems for the same reason, and quite
frankly their processes are just broken to begin with. I've done several
projects in regulated areas (medical, automotive, etc) where folks use
ancient stuff just by the excuse of too much paper work and the silly
idea that certain sheet of certification paper for some old piece of
code gives any clue on actual quality or helps with their own
qualification processes (which, when reading the regulations carefully,
quickly turns out as nonsense).

For using certain new CPU specific features, the need for a compiler
upgrade really should be no excuse. And at least for vast majority of
cases, a proper compiler could do it much better than the average
programmer.

And by "recently", I mean "anything since the glibc that came with Red Hat
Enterprise Linux 7" (2.17).

Uh, that's really ancient. Nobody can seriously expect modern features
on such an ancient distro. If people really insist spending so much
money for running such old code, instead of just doing a dist upgrade,
then I can only reply with "not our problem".

In that case it's the duty of Redhat, doing their job right for once and
provide newer libc and gcc versions (and yes, it really isn't a big deal
having multiple versions of them installed concurrently).

I'm really not a fan of running bleeding edge software, but doing a dist
upgrade every 1..2 years really isn't asking too much (except perhaps
for some very exceptional cases). Especially not for big organisation
with enough money for being able to spend so much money to such vendors.
(or are their pockets empty because they already pay so much for RHEL ?)

A platform-specific API to solve a problem that is already solved is "knock
yourself out, we're not going to use this." So my first suggestion is to
remove the "platform-specific" part and make this a cross-platform solution.

Speaking of platform specific problems: that's where I need to tell that
I'm *very* unhappy with you CPU vendor folks. Actually it's you, who
created a those - not just platform but CPU model specific - problems.
The whole topic is anything but new, there could have been a long term
solution decades ago, on silicon level. (I could say a lot about the
IMHO really weird opcode layout, but that'd be too far out of scope)

What we SW engineers need is an easy and fast method to act depending on
whether some CPU supports some feature (eg. a new opcode). Things like
cpuinfo are only a tiny piece of that. What we could really use is a
conditional jump/call based on whether feature X is supported - without
any kernel intervention. Then the machine code could be easily layed out
to support both cases with our without some feature X. Alternatively we
could have a fast trapping in useland - hw generated call already would
be a big help.

If we had something in that direction, we wouldn't have to have this
kind discussion here anymore - it would be entirely up to compiler and
library folks, no need for any kernel support at all.

Going back to AMX - just had a quick look at the spec (*1). Sorry, but
this thing is really weird and horrible to use. Come on, these chips
already have billions of transistors, it really can't hurt so much
spending a few more to provide a clean and easy to use machine code
interface. Grmmpf! (This is a general problem we've got with so many
HW folks, why can't them just talk to us SW folks first so we can find
a good solution for both sides, before that goes into the field ?)

And one point that immediately jumps into my mind (w/o looking deeper
into it): it introduces completely new registers - do we now need extra
code for tasks switching etc ?

Since this stuff seems to be for tensor operations, I really wonder why
that has to be inline with classic operations ? Maybe I'm just lacking
phantasy, but I don't see much use cases where the application would not
want to do those operations in larger batches and we already have
auxillary processors like TPUs or GPUs for. What is the intended use
case for that ? Who would actually benefit from that ?

--mtx

*1) 
https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@xxxxxxxxx -- +49-151-27565287