On Fri, Apr 21, 2017 at 01:19:16PM -0700, Dan Williams wrote: > On Fri, Apr 21, 2017 at 1:16 PM, Luck, Tony <tony.luck@xxxxxxxxx> wrote: > >>> > + if (!(mce->status & 0xef80) == BIT(7)) > >>> > >>> Can we get a define for this, or a comment explaining all the magic > >>> that's happening on that one line? > >> > >> Yes - also like lkp pointed out, the check isn't correct at all. Let me > >> figure out what really needs to be done, and I will resend with a better > >> comment. > > > > Needs extra parentheses to make it right. Vishal, sorry I led you astray. > > > > if (!((mce->status & 0xef80) == BIT(7))) > > > > The magic is shown in table 15-9 of the Intel Software Developers Manual > > (but perhaps not well explained there). > > > > mce->status in the above code is a value plucked from a machine check > > bank status register. See figure 15-6 in the SDM. The important bits for this > > are {15:0} which are the "MCA Error code". Table 15-9 shows how these > > are grouped into types, where the type is defined by the most significant '1' > > bit in the field (excluding bit 12 which is the Correction Report Filtering bit, > > see section 15.9.2.1). > > > > So if BIT(3) is the most significant bit, the this is a "Generic Cache Hierarchy" > > error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on. > > Ah, ok. > > > Maybe we should have defines in mce.h for them? It gets a bit more complicated > > as all the above only applies to Intel branded X86 CPUs ... on AMD different > > decoding rules apply. > > Yeah, this code is x86_64 generic so should call into helpers that do > the right thing per cpu type. Boris: you coded up a "static bool memory_error(struct mce *m)" function inside the patches for the corrected error thingy. Perhaps when it goes upstream it should be available for other users too? -Tony