Re: 答复: Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, May 30, 2019 at 09:13:39AM +0000, Tony W Wang-oc wrote:
> On Thu, May 30, 2019, Tony W Wang-oc wrote:
> > Hi Ashok,
> > I have two questions about this patch, could you help to check:
> > 
> > 1, for broadcast #MC exceptions, this patch seems require #MC exception
> > errors
> > set MCG_STATUS_RIPV = 1.
> > But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0
> > (like "Recoverable-not-continuable SRAR Type" Errors), for these errors
> > the patch doesn't seem to work, is that okay?
> > 
> > 2, for LMCE exceptions, this patch seems require #MC exception errors
> > set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally even
> > on offline CPU.
> > For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline CPU
> > handle these LMCE errors, is that okay?
> > 
> 
> More specifically, this patch seems require #MC exceptions meet the condition
> "MCG_STATUS_RIPV ^ MCG_STATUS_LMCES == 1"; But on a Xeon X5650 machine (SMP), 

The offline CPU will never get a LMCE=1, since those only happen on the CPU 
that's doing active work. Offline CPUs just sitting in idle.

The specific error here is a PCC=1, so irrespective of what happens
We do capture the errors in the per-cpu log, and kernel would panic. 

What specifically this patch tries to achieve is to leave an error
sitting with MCG-STATUS.MCIP=1 and another recoverable error would shut the 
system dowm. 

I don't see anything wrong with what this patch does.. 

> "Data CACHE Level-2 Generic Error" does not meet this condition.
> 
> I got below message from: https://www.centos.org/forums/viewtopic.php?p=292742
> 
> Hardware event. This is not a software error.
> MCE 0
> CPU 4 BANK 6 TSC b7065eeaa18b0 
> TIME 1545643603 Mon Dec 24 10:26:43 2018
> MCG status:MCIP 
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: Data CACHE Level-2 Generic Error
> STATUS b200000080000106 MCGSTATUS 4
> MCGCAP 1c09 APICID 4 SOCKETID 0 
> CPUID Vendor Intel Family 6 Model 44
> 
> > Thanks
> > Tony W Wang-oc



[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux