Re: edac driver initialization, interrupt, & debug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Please ignore the first question. I now see the expected EDAC message
in the kernel log:

EDAC MC0: 1 CE fsl_mc_err on mc#0csrow#0channel#0 (csrow:0 channel:0
page:0x5df1f offset:0xe40 grain:8 syndrome:0xe0e0)

1)  Is there anything similar to the edac-utils but for ARM instead of
x86, or does
sysfs replace the edac-utils, or is there something else for ARM?

2)  What is currently used for collecting and reporting ECC errors for
ARM/EDAC beyond the kernel log and messages?

https://github.com/grondo/edac-utils

3) How is RAS/rasdaemon reporting integrated into EDAC for error
collection and reporting?

4) Has there been a patch to prevent EDAC sysfs API from reporting bogus values?
See http://lkml.iu.edu/hypermail/linux/kernel/1205.3/02249.html
On Wed, Nov 21, 2018 at 11:01 AM Tracy Smith <tlsmith3777@xxxxxxxxx> wrote:
>
> Not probing the edac driver turned out to be a device tree issue as
> Steve suspected. Thanks to both Steve and York, this has been resolved
> and the backport is now logging ECC errors after injection. Added the
> ddr qoriq-memory-controller entry since we used a different .dtsi
> file.
>
> arch/arm64/boot/dts/freescale/...ls1043a.dtsi
>
> ddr: memory-controller@1080000
> { compatible = "fsl,qoriq-memory-controller"; reg = <0x0 0x1080000 0x0
> 0x1000>; interrupts = <0 144 0x4>; big-endian; };
>
> I now need to collect and report CE and UE ECC errors and extend the
> existing logging and reporting function that I currently see. After
> reviewing the following document, the system logging appears different
> from that given in the kernel EDAC document. I need the level of
> granularity described in the edac.txt file.
>
> https://www.mjmwired.net/kernel/Documentation/edac.txt#173 same as
> kernel/Documentation/edac.txt
>
> 1)  Can I gather the system logging described below in the edac.txt
> file for layerscape?
>
> 2)  Is there anything similar to the edac-utils but for ARM, or does
> sysfs replace the edac-utils, or something else?
>
> 3)  What is currently used for collecting and reporting ECC errors for
> ARM/EDAC beyond the kernel log and messages?
> https://github.com/grondo/edac-utils
>
> 4) How is RAS reporting integrated into EDAC for error collection and reporting?
>
> 5) Has there been a patch to prevent EDAC sysfs API from reporting bogus values?
> See http://lkml.iu.edu/hypermail/linux/kernel/1205.3/02249.html
>
> - The EDAC sysfs API will still report bogus values. So, userspace
> tools like edac-utils will still use the bogus data;
>
> - Add a new tracepoint-based way to get the binary information about
> the errors.
>
> This is the logging I currently see with layerscape EDAC. Need
> something explaining these fields.
>
> [ 407.612311] EDAC FSL_DDR MC0: Err Detect Register: 0x80000004 [
> 407.618182] EDAC FSL_DDR MC0: Faulty Data bit: 0
> [ 407.622793] EDAC FSL_DDR MC0: Expected Data / ECC:
> 0x40c50901_40c50900 / 0x800000f0
> [ 407.630443] EDAC FSL_DDR MC0: Captured Data / ECC: 0x40c50900_40c50901 / 0xf0
> [ 407.637571] EDAC FSL_DDR MC0: Err addr: 0x3e0bfff50
> [ 407.642440] EDAC FSL_DDR MC0: PFN: 0x003e0bff
>
> This is the level of detail I need:
>
> SYSTEM LOGGING
> --------------
>
> If logging for UEs and CEs is enabled, then system logs will contain
> information indicating that errors have been detected:
>
> EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
> channel 1 "DIMM_B1": amd76x_edac
>
> EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
> channel 1 "DIMM_B1": amd76x_edac
>
> The structure of the message is:
>     the memory controller            (MC0)
>     Error type                               (CE)
>     memory page                         (0x283)
>     offset in the page                   (0xce0)
>     the byte granularity                (grain 8)
>         or resolution of the error
>     the error syndrome                 (0xb741)
>     memory row                            (row 0)
>     memory channel                     (channel 1)
>     DIMM label, if set prior            (DIMM B1
>     and then an optional, driver-specific message that may
>             have additional information.
>
> Both UEs and CEs with no info will lack all but memory controller, error
> type, a notice of "no info" and then an optional, driver-specific error
> message.
>
> On Mon, Nov 19, 2018 at 10:48 AM York Sun <york.sun@xxxxxxx> wrote:
> >
> > On 11/19/18 8:38 AM, Tracy Smith wrote:
> > > Steve, you were correct, there wasn't a device tree entry for the
> > > qoriq memory controller in
> > > arch/arm64/boot/dts/freescale/fsl-ls1043a.dtsi.  I added it making it
> > > identical to the fsl-ls1046s.dtsi, which should have the same memory
> > > controller and entry as the ls1043a.  I added this but it didn't make
> > > a difference as far as being able to call the probe function. I'm now
> > > checking the mpc85xx_edac.c dtsi entry for comparison since York used
> > > the mpc85xx as the basis for the layerscape, but there is something
> > > else missing preventing the probe function from being called.
> > >
> > > @York
> > > What is your entry for
> > > /proc/device-tree/soc/ifc@1530000/board-control@1,0/compatible
> >
> > EDAC driver doesn't check IFC. Are you debugging EDAC for memory controller?
> >
> > >
> > > @York
> > > cat /proc/device-tree/compatible entry is this, is this correct?
> > > fsl,ls1043a-rdbfsl,ls1043a
> >
> > Once again, you are using your modified code on your own board. So it is
> > not ls1043ardb. This compatible has nothing to do with EDAC driver.
> >
> > I cannot help you with ls1043ardb because the real ls1043ardb board
> > doesn't support ECC. The closest board I have is ls1046ardb.
> >
> > >
> > >                 ddr: memory-controller@1080000 {
> > >                          compatible = "fsl,qoriq-memory-controller";
> > >                          reg = <0x0 0x1080000 0x0 0x1000>;
> > >                          interrupts = <0 144 0x4>;
> > >                          big-endian;
> > >                  };
> >
> > This is your source code, not your final device tree. Please learn to
> > use "fdt" command under U-Boot to dump your device tree before booting
> > Linux, or check after Linux is up. For your reference, on my ls1046ardb,
> > I have
> >
> > # cat /proc/device-tree/soc/memory-controller@1080000/compatible
> > fsl,qoriq-memory-controller
> >
> > York
>
>
>
> --
> Confidentiality notice: This e-mail message, including any
> attachments, may contain legally privileged and/or confidential
> information. If you are not the intended recipient(s), please
> immediately notify the sender and delete this e-mail message.



-- 
Confidentiality notice: This e-mail message, including any
attachments, may contain legally privileged and/or confidential
information. If you are not the intended recipient(s), please
immediately notify the sender and delete this e-mail message.



[Index of Archives]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux