Re: edac driver initialization, interrupt, & debug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Not probing the edac driver turned out to be a device tree issue as
Steve suspected. Thanks to both Steve and York, this has been resolved
and the backport is now logging ECC errors after injection. Added the
ddr qoriq-memory-controller entry since we used a different .dtsi
file.

arch/arm64/boot/dts/freescale/...ls1043a.dtsi

ddr: memory-controller@1080000
{ compatible = "fsl,qoriq-memory-controller"; reg = <0x0 0x1080000 0x0
0x1000>; interrupts = <0 144 0x4>; big-endian; };

I now need to collect and report CE and UE ECC errors and extend the
existing logging and reporting function that I currently see. After
reviewing the following document, the system logging appears different
from that given in the kernel EDAC document. I need the level of
granularity described in the edac.txt file.

https://www.mjmwired.net/kernel/Documentation/edac.txt#173 same as
kernel/Documentation/edac.txt

1)  Can I gather the system logging described below in the edac.txt
file for layerscape?

2)  Is there anything similar to the edac-utils but for ARM, or does
sysfs replace the edac-utils, or something else?

3)  What is currently used for collecting and reporting ECC errors for
ARM/EDAC beyond the kernel log and messages?
https://github.com/grondo/edac-utils

4) How is RAS reporting integrated into EDAC for error collection and reporting?

5) Has there been a patch to prevent EDAC sysfs API from reporting bogus values?
See http://lkml.iu.edu/hypermail/linux/kernel/1205.3/02249.html

- The EDAC sysfs API will still report bogus values. So, userspace
tools like edac-utils will still use the bogus data;

- Add a new tracepoint-based way to get the binary information about
the errors.

This is the logging I currently see with layerscape EDAC. Need
something explaining these fields.

[ 407.612311] EDAC FSL_DDR MC0: Err Detect Register: 0x80000004 [
407.618182] EDAC FSL_DDR MC0: Faulty Data bit: 0
[ 407.622793] EDAC FSL_DDR MC0: Expected Data / ECC:
0x40c50901_40c50900 / 0x800000f0
[ 407.630443] EDAC FSL_DDR MC0: Captured Data / ECC: 0x40c50900_40c50901 / 0xf0
[ 407.637571] EDAC FSL_DDR MC0: Err addr: 0x3e0bfff50
[ 407.642440] EDAC FSL_DDR MC0: PFN: 0x003e0bff

This is the level of detail I need:

SYSTEM LOGGING
--------------

If logging for UEs and CEs is enabled, then system logs will contain
information indicating that errors have been detected:

EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
channel 1 "DIMM_B1": amd76x_edac

EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
channel 1 "DIMM_B1": amd76x_edac

The structure of the message is:
    the memory controller            (MC0)
    Error type                               (CE)
    memory page                         (0x283)
    offset in the page                   (0xce0)
    the byte granularity                (grain 8)
        or resolution of the error
    the error syndrome                 (0xb741)
    memory row                            (row 0)
    memory channel                     (channel 1)
    DIMM label, if set prior            (DIMM B1
    and then an optional, driver-specific message that may
            have additional information.

Both UEs and CEs with no info will lack all but memory controller, error
type, a notice of "no info" and then an optional, driver-specific error
message.

On Mon, Nov 19, 2018 at 10:48 AM York Sun <york.sun@xxxxxxx> wrote:
>
> On 11/19/18 8:38 AM, Tracy Smith wrote:
> > Steve, you were correct, there wasn't a device tree entry for the
> > qoriq memory controller in
> > arch/arm64/boot/dts/freescale/fsl-ls1043a.dtsi.  I added it making it
> > identical to the fsl-ls1046s.dtsi, which should have the same memory
> > controller and entry as the ls1043a.  I added this but it didn't make
> > a difference as far as being able to call the probe function. I'm now
> > checking the mpc85xx_edac.c dtsi entry for comparison since York used
> > the mpc85xx as the basis for the layerscape, but there is something
> > else missing preventing the probe function from being called.
> >
> > @York
> > What is your entry for
> > /proc/device-tree/soc/ifc@1530000/board-control@1,0/compatible
>
> EDAC driver doesn't check IFC. Are you debugging EDAC for memory controller?
>
> >
> > @York
> > cat /proc/device-tree/compatible entry is this, is this correct?
> > fsl,ls1043a-rdbfsl,ls1043a
>
> Once again, you are using your modified code on your own board. So it is
> not ls1043ardb. This compatible has nothing to do with EDAC driver.
>
> I cannot help you with ls1043ardb because the real ls1043ardb board
> doesn't support ECC. The closest board I have is ls1046ardb.
>
> >
> >                 ddr: memory-controller@1080000 {
> >                          compatible = "fsl,qoriq-memory-controller";
> >                          reg = <0x0 0x1080000 0x0 0x1000>;
> >                          interrupts = <0 144 0x4>;
> >                          big-endian;
> >                  };
>
> This is your source code, not your final device tree. Please learn to
> use "fdt" command under U-Boot to dump your device tree before booting
> Linux, or check after Linux is up. For your reference, on my ls1046ardb,
> I have
>
> # cat /proc/device-tree/soc/memory-controller@1080000/compatible
> fsl,qoriq-memory-controller
>
> York



-- 
Confidentiality notice: This e-mail message, including any
attachments, may contain legally privileged and/or confidential
information. If you are not the intended recipient(s), please
immediately notify the sender and delete this e-mail message.



[Index of Archives]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux