On Tue, Jan 26, 2021 at 12:03:14PM +0100, Borislav Petkov wrote: > On Mon, Jan 25, 2021 at 02:55:09PM -0800, Luck, Tony wrote: > > And now I've changed it back to non-atomic (but keeping the > > slightly cleaner looking code style that I used for the atomic > > version). This one also works for thousands of injections and > > recoveries. Maybe take it now before it stops working again :-) > > Hmm, so the only differences I see between your v4 and this are: > > -@@ -1238,6 +1238,7 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin > +@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin > > static void kill_me_now(struct callback_head *ch) > { > ++ struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me); > ++ > + p->mce_count = 0; > force_sig(SIGBUS); > } > > Could the container_of() macro have changed something? That change was to fix my brown paper bag moment (does not compile without a variable named "p" in scope to be used on next line.) > Because we don't know yet (right?) why would it fail? Would it read > stale ->mce_count data? If so, then a barrier is missing somewhere. I don't see how a barrier would make a differece. In the common case all this code is executed on the same logical CPU. Return from the do_machine_check() tries to return to user mode and finds that there is some "task_work" to execute first. In some cases Linux might context switch to something else. Perhaps this task even gets picked up by another CPU to run the task work queued functions. But I imagine that the context switch should act as a barrier ... shouldn't it? > Or what is the failure exactly? After a few cycles of the test injection to user mode, I saw an overflow in the machine check bank. As if it hadn't been cleared from the previous iteration ... but all the banks are cleared as soon as we find that the machine check is recoverable. A while before getting to the code I changed. When the tests were failing, code was on top of v5.11-rc3. Latest experiments moved to -rc5. There's just a tracing fix from PeterZ between rc3 and rc5 to mce/core.c: 737495361d44 ("x86/mce: Remove explicit/superfluous tracing") which doesn't appear to be a candidate for the problems I saw. > Because if I take it now without us knowing what the issue is, it will > start failing somewhere - Murphy's our friend - and then we'll have to > deal with breaking people's boxes. Not fun. Fair point. > The other difference is: > > @@ -76,8 +71,10 @@ index 13d3f1cbda17..5460c146edb5 100644 > - current->mce_kflags = m->kflags; > - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > - current->mce_whole_page = whole_page(m); > ++ int count = ++current->mce_count; > ++ > + /* First call, save all the details */ > -+ if (current->mce_count++ == 0) { > ++ if (count == 1) { > + current->mce_addr = m->addr; > + current->mce_kflags = m->kflags; > + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > > Hmm, a local variable and a pre-increment. Can that have an effect somehow? This is the bit that changed during my detour using atomic_t mce_count. I added the local variable to capture value from atomic_inc_return(), then used it later, instead of a bunch of atomic_read() calls. I kept it this way because "if (count == 1)" is marginally easier to read than "if (current->mce_count++ == 0)" > > + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ > > Typo: likely. Oops. Fixed. -Tony