On Thu, Jul 6, 2023 at 10:12 PM Zhuo, Qiuxu <qiuxu.zhuo@xxxxxxxxx> wrote: > > > From: Luck, Tony <tony.luck@xxxxxxxxx> > > Sent: Wednesday, July 5, 2023 11:22 PM > > ... > > Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative > > > > >> # head /proc/cpuinfo > > > > This shows your system is the workstation version of Sapphire rapids. I don't > > think we did any validation of the EDAC driver against this model. > > No, we didn't do any validation of the EDAC on Sapphires Rapids workstations. > From the link below, we know this is a Sapphire Rapids workstation with only 2 memory controllers. > https://www.intel.com/content/www/us/en/products/sku/233480/intel-xeon-w32435-processor-22-5m-cache-3-10-ghz/specifications.html > > We only did validation on the Sapphire Rapids servers which were with 4 memory controllers per socket before. > > > > # dmidecode -t 17 > > > > You have just one 16GB DIMM, and EDAC found that. So despite the messy > > warnings, EDAC should be working for you. > > > > > # lspci > > > > I didn't dig into this. Qiuxu - can you compare this against a server Sapphire > > rapids? > > Maybe it has some clues so the EDAC driver will know not to look for non- > > existent memory controllers. > > This Sapphire Rapids workstation had 2 memory controllers but appeared > 4 memory controller PCIe devices from the log: > > 0000:fe:0c.0 1101: 8086:324a > 0000:fe:0d.0 1101: 8086:324a // absent mc fooling the driver, should not appear > 0000:fe:0e.0 1101: 8086:324a > 0000:fe:0f.0 1101: 8086:324a // absent mc fooling the driver, should not appear > > By observing that the MMIO registers of these absent > memory controllers consistently hold the value of ~0. > We may identify a memory controller as absent by checking > if its MMIO register "mcmtr" == ~0 in all its channels. > > I made a patch below to skip all these absent memory controllers > https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@xxxxxxxxx/T/#u > @Koba Ko, could you please verify the patch from the link above on your workstation? Thanks! Here's dmesg patched(Ref. 1). didn't find the previous message, `EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)` Ref. 1, https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJKmVlp?usp=sharing > > BTW, > Kai-Heng Feng also found the same issue before: > https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@xxxxxxxxxxxxxx/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2 > > - Qiuxu