On Monday, 28 June 2021 10:43:47 PDT Peter Zijlstra wrote: > None of those I think. The worst case is old executables using > MINSIGSTKSZ and not using the magic signal context at all, just regular > old signals. If you run them on an AVX512 enabled machine, they overflow > their stack and cause memory corruption. Indeed, they are already broken today. Retroactively fixing them 5 years later can be an additional goal, but it shouldn't stop us from having an AMX solution. BTW, exactly MINSIGSTKSZ on an SKX overflows inside the kernel, so there's no memory corruption in userspace. The signal handler is not even invoked and the process is killed by a second signal delivery. But somewhere between MINSIGSTKSZ and SIGSTKSZ, it will invoke the signal handler and will overflow inside the user code. If that alt-stack wasn't designed with guard pages, it will corrupt memory, as you said. > AFAICT the only feasible way forward with that is some sysctl which > default disables AVX512 and add the prctl() and have some unsafe wrapper > that enables AVX512 for select 'legacy' programs for as long as they > exist :/ > > That is, binaries/libraries compiled against a static (and small) > MINSIGSTKSZ are the enemy. Which brings us to: > > 5) #4, in which each part is comes from a separate library with no > > knowledge of each other, and initialised concurrently in different > > threads > > That's terrible... library init should *NEVER* spawn threads (I know, > don't start). Indeed, but that wasn't even what I was suggesting. I was thinking that the application would have started threads and the library init was run on separate threads. This may happen with lazy initialisation on first use. But threads aren't required for the problem to happen. If the crash-handling case runs first, before AMX state is enabled, it might decide that it doesn't need to allocate sufficient alt-stack space for AMX. I think the recommendation here for userspace is clear: allocate the maximum that XSAVE tells you that you'll need, regardless of what the ambient enabled feature set is. [And pretty please set up guard pages. Given that the XSAVE state area for SPR appears to be too close to 12 kB (see below), I'd say they should mmap() 20 kB and then mprotect() the lowest page to PROT_NONE.] > Anything that does this is basically unfixable, because we can't > guarantee the AMX prctl() gets done before the first thread. > > So yes, worst case I suppose... To wit: the worst case is a static, small alt-stack *without* guard pages (e.g., a malloc()ed or even static buffer) of sufficient size to let the kernel transition back to userspace but not for the userspace routine to run. Any code using the old MINSIGSTKSZ (2048) will fail to run for AVX512 in the first place. There will be no data corruption, the crash handler will not run and the application will simply crash. An out-of-process core dump handler (if any, like systemd-coredump) will still get run. Code using SIGSTKSZ (8192) will run for AVX512 and there'll be about 5 kB left to the user routine. So if it doesn't have too deep a call stack, will not corrupt memory. And this code will not run for AMX state, because it's smaller than the XSAVE state (see below), making that the same case as the MINSIGSTKSZ for AVX512 above. It's possible someone would use something between those two values, but why? I expect that alt-stack handlers that use a constant value use either of the two constants or maybe some multiple of SIGSTKSZ, but nothing in-between them. $ /opt/intel/sde-external-8.63.0-2021-01-18-lin/sde64 -spr -- cpuid -1 -l 0xd CPU: XSAVE features (0xd/0): XCR0 lower 32 bits valid bit field mask = 0x000600ff XCR0 upper 32 bits valid bit field mask = 0x00000000 XCR0 supported: x87 state = true XCR0 supported: SSE state = true XCR0 supported: AVX state = true XCR0 supported: MPX BNDREGS = true XCR0 supported: MPX BNDCSR = true XCR0 supported: AVX-512 opmask = true XCR0 supported: AVX-512 ZMM_Hi256 = true XCR0 supported: AVX-512 Hi16_ZMM = true IA32_XSS supported: PT state = false XCR0 supported: PKRU state = false XCR0 supported: CET_U state = false XCR0 supported: CET_S state = false IA32_XSS supported: HDC state = false IA32_XSS supported: UINTR state = false LBR supported = false IA32_XSS supported: HWP state = false XTILECFG supported = true XTILEDATA supported = true bytes required by fields in XCR0 = 0x00002b00 (11008) bytes required by XSAVE/XRSTOR area = 0x00002b00 (11008) -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel DPG Cloud Engineering