Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery

Borislav Petkov <bp@xxxxxxxxx> · Thu, 28 Jan 2021 18:57:35 +0100

On Tue, Jan 26, 2021 at 02:36:05PM -0800, Luck, Tony wrote:
> In some cases Linux might context switch to something else. Perhaps
> this task even gets picked up by another CPU to run the task work
> queued functions.  But I imagine that the context switch should act
> as a barrier ... shouldn't it?

I'm given to understand that the #MC from user is likely to schedule and
a context switch has a barrier character.

> After a few cycles of the test injection to user mode, I saw an
> overflow in the machine check bank. As if it hadn't been cleared
> from the previous iteration ...

This sounds weird. As if something else is happening which we haven't
thought of yet...

> When the tests were failing, code was on top of v5.11-rc3. Latest
> experiments moved to -rc5.  There's just a tracing fix from
> PeterZ between rc3 and rc5 to mce/core.c:
> 
> 737495361d44 ("x86/mce: Remove explicit/superfluous tracing")
> 
> which doesn't appear to be a candidate for the problems I saw.

Doesn't look like it.

> This is the bit that changed during my detour using atomic_t mce_count.
> I added the local variable to capture value from atomic_inc_return(), then
> used it later, instead of a bunch of atomic_read() calls.
> 
> I kept it this way because "if (count == 1)" is marginally easier to read
> than "if (current->mce_count++ == 0)"

Right.

So still no explanation why it would fail before. ;-\

Crazy idea: if you still can reproduce on -rc3, you could bisect: i.e.,
if you apply the patch on -rc3 and it explodes and if you apply the same
patch on -rc5 and it works, then that could be a start... Yeah, don't
have a better idea here. :-\

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette