On Mon, Jan 25, 2021 at 02:55:09PM -0800, Luck, Tony wrote: > And now I've changed it back to non-atomic (but keeping the > slightly cleaner looking code style that I used for the atomic > version). This one also works for thousands of injections and > recoveries. Maybe take it now before it stops working again :-) Hmm, so the only differences I see between your v4 and this are: -@@ -1238,6 +1238,7 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin +@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin static void kill_me_now(struct callback_head *ch) { ++ struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me); ++ + p->mce_count = 0; force_sig(SIGBUS); } Could the container_of() macro have changed something? Because we don't know yet (right?) why would it fail? Would it read stale ->mce_count data? If so, then a barrier is missing somewhere. Or what is the failure exactly? Because if I take it now without us knowing what the issue is, it will start failing somewhere - Murphy's our friend - and then we'll have to deal with breaking people's boxes. Not fun. The other difference is: @@ -76,8 +71,10 @@ index 13d3f1cbda17..5460c146edb5 100644 - current->mce_kflags = m->kflags; - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); - current->mce_whole_page = whole_page(m); ++ int count = ++current->mce_count; ++ + /* First call, save all the details */ -+ if (current->mce_count++ == 0) { ++ if (count == 1) { + current->mce_addr = m->addr; + current->mce_kflags = m->kflags; + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); Hmm, a local variable and a pre-increment. Can that have an effect somehow? > + /* Ten is likley overkill. Don't expect more than two faults before task_work() */ Typo: likely. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette