On Tue, Jan 26, 2021 at 02:36:05PM -0800, Luck, Tony wrote: > In some cases Linux might context switch to something else. Perhaps > this task even gets picked up by another CPU to run the task work > queued functions. But I imagine that the context switch should act > as a barrier ... shouldn't it? I'm given to understand that the #MC from user is likely to schedule and a context switch has a barrier character. > After a few cycles of the test injection to user mode, I saw an > overflow in the machine check bank. As if it hadn't been cleared > from the previous iteration ... This sounds weird. As if something else is happening which we haven't thought of yet... > When the tests were failing, code was on top of v5.11-rc3. Latest > experiments moved to -rc5. There's just a tracing fix from > PeterZ between rc3 and rc5 to mce/core.c: > > 737495361d44 ("x86/mce: Remove explicit/superfluous tracing") > > which doesn't appear to be a candidate for the problems I saw. Doesn't look like it. > This is the bit that changed during my detour using atomic_t mce_count. > I added the local variable to capture value from atomic_inc_return(), then > used it later, instead of a bunch of atomic_read() calls. > > I kept it this way because "if (count == 1)" is marginally easier to read > than "if (current->mce_count++ == 0)" Right. So still no explanation why it would fail before. ;-\ Crazy idea: if you still can reproduce on -rc3, you could bisect: i.e., if you apply the patch on -rc3 and it explodes and if you apply the same patch on -rc5 and it works, then that could be a start... Yeah, don't have a better idea here. :-\ -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette