Andy, On Fri, Mar 26 2021 at 16:18, Andy Lutomirski wrote: > arch_prctl(ARCH_SET_XCR0, xcr0, lazy_states, sigsave_states, > sigclear_states, 0); > > Sets xcr0. All states are preallocated except that states in > lazy_states may be unallocated in the kernel until used. (Not > supported at all in v1. lazy_states & ~xcr0 != 0 is illegal.) States > in sigsave_states are saved in the signal frame. States in > sigclear_states are reset to the init state on signal delivery. > States in sigsave_states are restored by sigreturn, and states not in > sigsave_states are left alone by sigreturn. I like the idea in principle. > Optionally we come up with a new format for new features in the signal > frame, since the current format is showing its age. Taking 8kB for a > signal with AMX is one thing. Taking another 8kB for a nested signal > if AMX is not in use is worse. I don't think that we should make that optional to begin with. Sizing sigaltstack is lottery as of today and making it more so does not help at all. > Optionally we make AVX-512 also default off, which fixes what is > arguably a serious ABI break with AVX-512: lots of programs, following > POSIX (!), seem to think that they know much much space to allocate > for sigaltstack(). AVX-512 is too big. I really wish we could do that. That AVX512 disaster is not trivial to sort. Let's focus on AMX first. That ship at least has not sailed yet, but if it does without a proper resolution then it's going to sail deep south. Maybe we end up with some ideas about the AVX512 issue as well that way. The main problem I see is simply historical. Every other part of the user stack space from libraries to applications tries to be "smart" about utilizing the assumed best instruction set, feature extensions which are detected when something is initialized. I can sing a song of that because I was casually involved porting debian to an unsupported architecture. Magic all over the place. Now add the whole pile of proprietary software stacks, libraries on top of that picture and things get completely out of control. Why? Simply because user space has absolutely no concept about orchestrating these things at all. That worked for a while by some definition of works and this model is still proliferated today even by players who should know better. Even if you expected that some not so distant events and the experience with fleet consistency would have stopped the 'performance first, features first' chorus in some way, that's not what reality is. Linux is not necessarily innocent. For years we just crammed features into the kernel without thinking too hard about the big picture. But, yes we realized the hard way that there is a problem and just adding yet another magic 'make it work' hack for AMX is definitely the wrong approach. What are the possible problems when we make it a hard requirement for AMX to be requested by an application/task in order to use it? For the kernel itself. Not really any consequence I can think off aside of unhappy campers in user space. For user space this is disruptive and we have at least to come up with some reasonable model how all involved components with different ideas of how to best utilize a given CPU can be handled. That starts at the very simple problem of feature enumeration. Up to now CPUID is non-priviledged and a large amount of user space just takes that as the ultimate reference. We can change that when CPUID faulting in CPL3 is supported by the CPU which we can't depend on because it is not architectural. Though the little devil in my head tells me, that making AMX support depend on the CPUID faulting capability might be not the worst thing. Then we actually enforce CPUID faulting (finally) on CPUs which support it, which would be a first step into the right direction simply because then random library X has to go to the kernel and ask for it explicitely or just shrug and use whatever the kernel is willing to hand out in CPUID. Now take that one step further. When the first part of some user space application asks for it, then you can register that with the process and make sane decisions for all other requesters which come after it, which is an important step into the direction of having a common orchestration for this. Sure you can do that via XCR0 as well to some extent, but that CPUID fault would solve a whole class of other problems which people who care about feature consistency face today at least to some extent. And contrary to XCR0, which is orthogonal and obviously still required for the AMX (and hint AVX512) problem, CPUID faulting would just hand out the feature bits which the kernel want's to hand out. If the app, library or whatever still tries to use them, then they get the #UD, #GP or whatever penalty is associated to that particular XCR0 disabled piece. It's not there, you tried, keep the pieces. Making it solely depend on XCR0 and fault if not requested upfront is bringing you into the situation that you broke 'legacy code' which relied on the CPUID bit and that worked until now which gets you in the no-regression trap. I haven't thought this through obviously, but depending solely on XCR0 faults did not really sum up, so I thought I share that evil idea for broader discussion. Thanks, tglx