Re: [PATCH v2] EDAC/i10nm: shift exponent is negative

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jul 6, 2023 at 10:12 PM Zhuo, Qiuxu <qiuxu.zhuo@xxxxxxxxx> wrote:
>
> > From: Luck, Tony <tony.luck@xxxxxxxxx>
> > Sent: Wednesday, July 5, 2023 11:22 PM
> > ...
> > Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative
> >
> > >> # head /proc/cpuinfo
> >
> > This shows your system is the workstation version of Sapphire rapids. I don't
> > think we did any validation of the EDAC driver against this model.
>
> No, we didn't do any validation of the EDAC on Sapphires Rapids workstations.
> From the link below, we know this is a Sapphire Rapids workstation with only 2 memory controllers.
> https://www.intel.com/content/www/us/en/products/sku/233480/intel-xeon-w32435-processor-22-5m-cache-3-10-ghz/specifications.html
>
> We only did validation on the Sapphire Rapids servers which were with 4 memory controllers per socket before.
>
> > > # dmidecode -t 17
> >
> > You have just one 16GB DIMM, and EDAC found that. So despite the messy
> > warnings, EDAC should be working for you.
> >
> > > # lspci
> >
> > I didn't dig into this. Qiuxu - can you compare this against a server Sapphire
> > rapids?
> > Maybe it has some clues so the EDAC driver will know not to look for non-
> > existent memory controllers.
>
> This Sapphire Rapids workstation had 2 memory controllers but appeared
> 4 memory controller PCIe devices from the log:
>
>     0000:fe:0c.0 1101: 8086:324a
>     0000:fe:0d.0 1101: 8086:324a // absent mc fooling the driver, should not appear
>     0000:fe:0e.0 1101: 8086:324a
>     0000:fe:0f.0 1101: 8086:324a // absent mc fooling the driver, should not appear
>
> By observing that the MMIO registers of these absent
> memory controllers consistently hold the value of ~0.
> We may identify a memory controller as absent by checking
> if its MMIO register "mcmtr" == ~0 in all its channels.
>
> I made a patch below to skip all these absent memory controllers
> https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@xxxxxxxxx/T/#u
> @Koba Ko, could you please verify the patch from the link above on your workstation? Thanks!

Here's dmesg patched(Ref. 1). didn't find the previous message,
`EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)`

Ref. 1, https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJKmVlp?usp=sharing

>
> BTW,
> Kai-Heng Feng also found the same issue before:
> https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@xxxxxxxxxxxxxx/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2
>
> - Qiuxu




[Index of Archives]     [Kernel Development]     [Kernel Announce]     [Kernel Newbies]     [Linux Networking Development]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Device Mapper]

  Powered by Linux