> -----Original Mail----- > Sender: Raj, Ashok <ashok.raj@xxxxxxxxx> > Time: 2019.05.31 1:11 > To : Tony W Wang-oc <TonyWWang-oc@xxxxxxxxxxx> > CC: tipbot@xxxxxxxxx; bp@xxxxxxx; hpa@xxxxxxxxx; > linux-edac@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > linux-tip-commits@xxxxxxxxxxxxxxx; mingo@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; > stable@xxxxxxxxxxxxxxx; tglx@xxxxxxxxxxxxx; tony.luck@xxxxxxxxx; > torvalds@xxxxxxxxxxxxxxxxxxxx; David Wang <DavidWang@xxxxxxxxxxx>; Ashok > Raj <ashok.raj@xxxxxxxxx> > Topic: Re: Re: Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t > participate in rendezvous process > > On Thu, May 30, 2019 at 09:13:39AM +0000, Tony W Wang-oc wrote: > > On Thu, May 30, 2019, Tony W Wang-oc wrote: > > > Hi Ashok, > > > I have two questions about this patch, could you help to check: > > > > > > 1, for broadcast #MC exceptions, this patch seems require #MC > > > exception errors set MCG_STATUS_RIPV = 1. > > > But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0 > > > (like "Recoverable-not-continuable SRAR Type" Errors), for these > > > errors the patch doesn't seem to work, is that okay? > > > > > > 2, for LMCE exceptions, this patch seems require #MC exception > > > errors set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally > > > even on offline CPU. > > > For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline > > > CPU handle these LMCE errors, is that okay? > > > > > > > More specifically, this patch seems require #MC exceptions meet the > > condition "MCG_STATUS_RIPV ^ MCG_STATUS_LMCES == 1"; But on a Xeon > > X5650 machine (SMP), > > The offline CPU will never get a LMCE=1, since those only happen on the CPU > that's doing active work. Offline CPUs just sitting in idle. So, for intel CPU, LMCE is only for Thread level(or core level) error? If not, suppose 2 threads share level-2 cache. And thread 0 is active, thread 1 was offlined by SW. When MCE for this level-2 cache occurred, thread 1 will be active. When thread 1 read mcgstatus.lmce, the result will be always 0? Thanks. > > The specific error here is a PCC=1, so irrespective of what happens We do capture > the errors in the per-cpu log, and kernel would panic. > > What specifically this patch tries to achieve is to leave an error sitting with > MCG-STATUS.MCIP=1 and another recoverable error would shut the system > dowm. > > I don't see anything wrong with what this patch does.. > > > "Data CACHE Level-2 Generic Error" does not meet this condition. > > > > I got below message from: > > https://www.centos.org/forums/viewtopic.php?p=292742 > > > > Hardware event. This is not a software error. > > MCE 0 > > CPU 4 BANK 6 TSC b7065eeaa18b0 > > TIME 1545643603 Mon Dec 24 10:26:43 2018 MCG status:MCIP MCi status: > > Uncorrected error > > Error enabled > > Processor context corrupt > > MCA: Data CACHE Level-2 Generic Error > > STATUS b200000080000106 MCGSTATUS 4 > > MCGCAP 1c09 APICID 4 SOCKETID 0 > > CPUID Vendor Intel Family 6 Model 44 > > > > > Thanks > > > Tony W Wang-oc