On 5/8/24 11:03, Chang S. Bae wrote: > On 5/8/2024 7:40 AM, Dave Hansen wrote: >> On 5/7/24 16:53, Chang S. Bae wrote: >> >>> However, due to resource constraints in storage, AMX state is excluded >>> from the scope of state recovery. Consequently, AMX state must be in its >>> initialized state for the IFS test to run. >> >> This doesn't mention how this issue got introduced. Are we all bad at >> reading the SDM? :) > > Ah, I'd rather zap out this SDM sentence. My point is that this is fixing a bug. Where did that bug come from? What got screwed up here? Hint: I don't think us software folks screwed up here. It was likely the folks that built the two hardware features (AMX and IFS) forgot to talk to each other, or someone forgot to document the AMX clobbering aspect of the architecture. >>> When AMX workloads are running, an active user AMX state remains even >>> after a context switch, optimizing to reduce the state reload cost. In >>> such cases, the test cannot proceed if it is scheduled. >> >> This is a bit out of the blue. What does scheduling have do do with IFS? ... > So, the CPU stopper threads for <cpu#> and its sibling to execute > doscan() are queued up with the highest priority. ... But this is the IFS implementation *today*. The explanation depends on IFS being implemented with something that context switches. It also depends on folks expecting context switches to always switch FPU state. I'd just say: The kernel generally runs with live user FPU state, including AMX. That state can prevent IFS tests from running. That's _much_ more simple, generic and also fully explains the situation. It also isn't dependent on the IFS stop_cpus_run() implementation of today, which could totally change tomorrow. The underlying rule has zero to do with scheduling or context switching optimizations.