Having a proper interface (syscall, prctl) which user space can use to ask for permission and allocation of the necessary buffer(s) is clearly avoiding the downsides and provides the necessary mechanisms for proper control and failure handling.
this would need to be a "get / put" interface, so a refcount; that way things nest nicely. For API symmetry I'd want to have the put there, even if we may decide to be infinitely lazy in cleaning up the state. it also would want it to take an arguement that's a bitmask, so that this can be applied to future state as well. Eh actually I'd start with also adding AVX512 to this. Even though for obvious compat reasons that one is on by default (so at process start we might need to start with a count of 1) it's interesting to fold that into this same framework. (and who knows, dropping AVX512 state if you don't need it might improve context switches) Syscalls are relatively cheap (and I can imagine the C library doing a TLS cache of the count if it becomes an issue) so can be done on a relatively finegrained level. I've worked on OpenBLAS before, and that library basically has a global initialization function that ends up getting called on the first big math op (it may spawn threads as well etc) but which "stays around" for consecutive math functions; a get/put model would work quite well for such math library (since it's based on BLAS like almost all such math libraries, I expect this to be the common pattern)