Re: [PATCH] x86/mm: Disable preemption during CR3 read+write

Greg KH <greg@xxxxxxxxx> · Wed, 18 Oct 2017 18:51:39 +0200

On Wed, Oct 18, 2017 at 05:01:57PM +0200, Bernhard Kaindl wrote:
> Hi Greg!
> 
> On 17.10.2017 17:57, Sebastian Andrzej Siewior wrote:
> > Upstream commit 5cf0791da5c162ebc14b01eb01631cfa7ed4fa6e
> 
> This race happens when a process exit is preempted at the wrong time and
> we confirm the bug fixed by this commit to happen on Linux-3.18 systems:
> 
> This is how we are affected without this fix which is missing in 3.18.x:
> RT Process fails to make progress -> HW watchdog fires -> System resets.
> 
> We were able to see the commit's sequence of events and the CR3 having the
> PGD of a dead process using ftrace + SW watchdog.
> 
> Systems with SCHED_FIFO tasks are especially vulnerable, because when
> the SCHED_FIFO task with the highest priority gets the deceased PGD@CR3, the
> task will page fault forever without making progress and no other process
> can be scheduled anymore.
> 
> If this SCHED_FIFO task is triggering an HW watchdog, the HW watchdog
> will fire, but if not, the system will ping, but not do anything else.
> 
> With kernel.sched_rt_runtime_us > 0, SCHED_OTHER processes could cause
> a context switch after kernel.sched_rt_period_us expires, so usually
> this would allow the system to recover, because then CR3 would be swichted,
> but this is too late, and a real-time system would have failed
> at this point already.
> 
> With kernel.sched_rt_runtime_us < 0, the only recovery in this case is a HW
> watchdog resetting the machine, but with devastating loss of function until
> the system is up again.
> 
> All UP preemptible-kernel x86 real-time systems, including industrial
> control/automation, SCADA, Linux-based PLCs (e.g. using Intel Quark),
> are definitely affected when process termination collide with HW/SW
> interrupts.
> 
> Non-real time systems: Except for some threads occasionally failing to make
> progress, the system will recover:
> Other processes will eventually be scheduled, causing CR3 to be loaded again
> correctly from task->mm->pgd, resolving the problem.
> 
> > This patch is already part of various stable tree but is missing in the >    v3.18>    v4.1
> Yes, the long-term branches of 3.2, 3.10, 3.16 and 4.4 have got the fix
> (long time ago!), 4.9 already has it merged mainline.
> 
> > tree and applies cleanly on top of
> >    v3.18.69
> >    v4.1.43
> > 
> > I've been contacted by Bernhard Kaindl (Cc:) and he asked about the
> > whereabouts of the patch in the two stable trees. He can confirm that
> > this patch cures his problem on the v3.18 stable tree he is using.
> > He assumes that the same problem might occur on the v4.1 tree and should
> > be fixed by the patch but he has no working setup with v4.1 kernel to
> > confirm this.
> 
> I comfirm - here is a quick summary of what we found:
> 
> We saw that our watchdog process got the PGD of a dead process in CR3,
> causing failure to pulse the watchdog because of the page fault loop
> described in the commit log.
> 
> We had a lab of 16 machines available for testing the crash fixed by this
> commit.
> 
> We found this fix by pure luck thanks to Google after a lot of searches by
> several people. With the fix, over this weekend, in the lab, we didn't
> trigger this issue anymore.
> 
> (actually, we found another issue in our own code and had an unknown machine
> hang for which to debug, we need more specific HW which we don't have ATM,
> but it is likely that this is also the same issue caused by our own bug)
> 
> Before having the fix, we demonstrated the sequence of events which the
> commit log describes within one hour on a single machine exactly.
> 
> With Linux-4.4.64 (which does have this fix), we didn't see this bug.
> 
> Because it appears to fix both 3.18 and 4.4, it makes sense to apply it to
> the v4.1.x longterm branch too.

Thanks for the detailed description, much appreciated.  I've queued it
up for 3.18, it's up to Sasha to do it for 4.1.

thanks again,

greg k-h