Please ignore the first question. I now see the expected EDAC message in the kernel log: EDAC MC0: 1 CE fsl_mc_err on mc#0csrow#0channel#0 (csrow:0 channel:0 page:0x5df1f offset:0xe40 grain:8 syndrome:0xe0e0) 1) Is there anything similar to the edac-utils but for ARM instead of x86, or does sysfs replace the edac-utils, or is there something else for ARM? 2) What is currently used for collecting and reporting ECC errors for ARM/EDAC beyond the kernel log and messages? https://github.com/grondo/edac-utils 3) How is RAS/rasdaemon reporting integrated into EDAC for error collection and reporting? 4) Has there been a patch to prevent EDAC sysfs API from reporting bogus values? See http://lkml.iu.edu/hypermail/linux/kernel/1205.3/02249.html On Wed, Nov 21, 2018 at 11:01 AM Tracy Smith <tlsmith3777@xxxxxxxxx> wrote: > > Not probing the edac driver turned out to be a device tree issue as > Steve suspected. Thanks to both Steve and York, this has been resolved > and the backport is now logging ECC errors after injection. Added the > ddr qoriq-memory-controller entry since we used a different .dtsi > file. > > arch/arm64/boot/dts/freescale/...ls1043a.dtsi > > ddr: memory-controller@1080000 > { compatible = "fsl,qoriq-memory-controller"; reg = <0x0 0x1080000 0x0 > 0x1000>; interrupts = <0 144 0x4>; big-endian; }; > > I now need to collect and report CE and UE ECC errors and extend the > existing logging and reporting function that I currently see. After > reviewing the following document, the system logging appears different > from that given in the kernel EDAC document. I need the level of > granularity described in the edac.txt file. > > https://www.mjmwired.net/kernel/Documentation/edac.txt#173 same as > kernel/Documentation/edac.txt > > 1) Can I gather the system logging described below in the edac.txt > file for layerscape? > > 2) Is there anything similar to the edac-utils but for ARM, or does > sysfs replace the edac-utils, or something else? > > 3) What is currently used for collecting and reporting ECC errors for > ARM/EDAC beyond the kernel log and messages? > https://github.com/grondo/edac-utils > > 4) How is RAS reporting integrated into EDAC for error collection and reporting? > > 5) Has there been a patch to prevent EDAC sysfs API from reporting bogus values? > See http://lkml.iu.edu/hypermail/linux/kernel/1205.3/02249.html > > - The EDAC sysfs API will still report bogus values. So, userspace > tools like edac-utils will still use the bogus data; > > - Add a new tracepoint-based way to get the binary information about > the errors. > > This is the logging I currently see with layerscape EDAC. Need > something explaining these fields. > > [ 407.612311] EDAC FSL_DDR MC0: Err Detect Register: 0x80000004 [ > 407.618182] EDAC FSL_DDR MC0: Faulty Data bit: 0 > [ 407.622793] EDAC FSL_DDR MC0: Expected Data / ECC: > 0x40c50901_40c50900 / 0x800000f0 > [ 407.630443] EDAC FSL_DDR MC0: Captured Data / ECC: 0x40c50900_40c50901 / 0xf0 > [ 407.637571] EDAC FSL_DDR MC0: Err addr: 0x3e0bfff50 > [ 407.642440] EDAC FSL_DDR MC0: PFN: 0x003e0bff > > This is the level of detail I need: > > SYSTEM LOGGING > -------------- > > If logging for UEs and CEs is enabled, then system logs will contain > information indicating that errors have been detected: > > EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, > channel 1 "DIMM_B1": amd76x_edac > > EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, > channel 1 "DIMM_B1": amd76x_edac > > The structure of the message is: > the memory controller (MC0) > Error type (CE) > memory page (0x283) > offset in the page (0xce0) > the byte granularity (grain 8) > or resolution of the error > the error syndrome (0xb741) > memory row (row 0) > memory channel (channel 1) > DIMM label, if set prior (DIMM B1 > and then an optional, driver-specific message that may > have additional information. > > Both UEs and CEs with no info will lack all but memory controller, error > type, a notice of "no info" and then an optional, driver-specific error > message. > > On Mon, Nov 19, 2018 at 10:48 AM York Sun <york.sun@xxxxxxx> wrote: > > > > On 11/19/18 8:38 AM, Tracy Smith wrote: > > > Steve, you were correct, there wasn't a device tree entry for the > > > qoriq memory controller in > > > arch/arm64/boot/dts/freescale/fsl-ls1043a.dtsi. I added it making it > > > identical to the fsl-ls1046s.dtsi, which should have the same memory > > > controller and entry as the ls1043a. I added this but it didn't make > > > a difference as far as being able to call the probe function. I'm now > > > checking the mpc85xx_edac.c dtsi entry for comparison since York used > > > the mpc85xx as the basis for the layerscape, but there is something > > > else missing preventing the probe function from being called. > > > > > > @York > > > What is your entry for > > > /proc/device-tree/soc/ifc@1530000/board-control@1,0/compatible > > > > EDAC driver doesn't check IFC. Are you debugging EDAC for memory controller? > > > > > > > > @York > > > cat /proc/device-tree/compatible entry is this, is this correct? > > > fsl,ls1043a-rdbfsl,ls1043a > > > > Once again, you are using your modified code on your own board. So it is > > not ls1043ardb. This compatible has nothing to do with EDAC driver. > > > > I cannot help you with ls1043ardb because the real ls1043ardb board > > doesn't support ECC. The closest board I have is ls1046ardb. > > > > > > > > ddr: memory-controller@1080000 { > > > compatible = "fsl,qoriq-memory-controller"; > > > reg = <0x0 0x1080000 0x0 0x1000>; > > > interrupts = <0 144 0x4>; > > > big-endian; > > > }; > > > > This is your source code, not your final device tree. Please learn to > > use "fdt" command under U-Boot to dump your device tree before booting > > Linux, or check after Linux is up. For your reference, on my ls1046ardb, > > I have > > > > # cat /proc/device-tree/soc/memory-controller@1080000/compatible > > fsl,qoriq-memory-controller > > > > York > > > > -- > Confidentiality notice: This e-mail message, including any > attachments, may contain legally privileged and/or confidential > information. If you are not the intended recipient(s), please > immediately notify the sender and delete this e-mail message. -- Confidentiality notice: This e-mail message, including any attachments, may contain legally privileged and/or confidential information. If you are not the intended recipient(s), please immediately notify the sender and delete this e-mail message.