Re: [GIT PULL 1/7] soc/tegra: Changes for v5.20-rc1

Sumit Gupta <sumitg@xxxxxxxxxx> · Fri, 15 Jul 2022 13:36:16 +0530

Hi Arnd, Boris,

Thank you for your inputs.

I think this is just a reflection of what other hardware can do:
most machines only detect memory errors, but the EDAC subsystem
can work with any type in principle. There are also a lot of
conditions elsewhere that can be detected but not corrected.

Just a couple of thoughts from looking at this:

So the EDAC thing reports *hardware* errors by using the RAS
capabilities built into an IP block. So it started with memory
controllers but it is getting extended to other blocks. AMD are looking
at how to integrate GPU hw errors reporting into it, for example.

Looking at that CBB thing, it looks like it is supposed to report not
so much hardware errors but operational errors. Some of the hw errors
reported by RAS hw are also operation-related but not the majority.

CBB driver reports errors due to bad MMIO accesses within software.
The vast majority of the CBB errors tend to be programming errors in 
setting up address windows leading to decode errors.

Then, EDAC has this counters exposed in:

$ grep -r . /sys/devices/system/edac/
/sys/devices/system/edac/power/runtime_active_time:0
/sys/devices/system/edac/power/runtime_status:unsupported
/sys/devices/system/edac/power/runtime_suspended_time:0
/sys/devices/system/edac/power/control:auto
/sys/devices/system/edac/pci/edac_pci_log_pe:1
/sys/devices/system/edac/pci/pci0/pe_count:0
/sys/devices/system/edac/pci/pci0/npe_count:0
/sys/devices/system/edac/pci/pci_parity_count:0
/sys/devices/system/edac/pci/pci_nonparity_count:0
/sys/devices/system/edac/pci/edac_pci_log_npe:1
/sys/devices/system/edac/pci/edac_pci_panic_on_pe:0
/sys/devices/system/edac/pci/check_pci_errors:0
/sys/devices/system/edac/mc/power/runtime_active_time:0
/sys/devices/system/edac/mc/power/runtime_status:unsupported
...

with the respective hierarchy: memory controllers, PCI errors, etc.

So the main question is, does it make sense for you to fit this into the
EDAC hierarchy and what would even be the advantage of making it part of
EDAC?

I also think this doesn't seem to fit with the errors reported by EDAC 
which are mainly hardware errors as Boris explained.
Please share your thoughts and if we can merge the patches as it is.

HTH.

--
Regards/Gruss,
     Boris.

https://people.kernel.org/tglx/notes-about-netiquette