Re: Please help! AM35xx mm/slab.c BUG

CF Adad <cfadad@xxxxxxxxxxxxxx> · Thu, 14 Jun 2012 21:23:25 -0700 (PDT)

Hi Jean-Philippe,

Thanks for the notes. I agree whole-heartedly with the idea that we're having a hardware issue. I'm just trying to eliminate any other software-related issues if I can. Since I'm really just a software engineer, I have to leave the hardware work to the hardware folks. All I can do is try to make sure the software is as rock solid as i can. Frankly however, I think we're running out of places to look in software, provided you folks do not know of any outstanding issues similar to ours in the core AM35xx support at the moment.

The *only* thing that seems to stand in contrast to the notion of this being a hardware problem is that we have *finally* managed to see a few of these crashes on our development kit. Since that kit contains none of our hardware, that was pretty surprising. That platform however, is certainly _exponentially_ more stable than ours at the moment, which does support the idea of this being mostly or perhaps completely a hardware issue.

Our design is using a COM developed by another vendor and a custom baseboard. As noted in another post, our power has been "glitchy" since our first prototype baseboards arrived. Our board uses a standard, PC-grade power supply that we selected from a very reputable brand due to its very low stated minimum loads. Obviously AM35xx processors, even with a load of attachements, do not demand much relative to a standard PC. We expected that power to be clean due to the reputation of the supply manufacturer and the fact we met the minimum specs. So we basically wired its 5V and 3.3V rails directly into our board and the AM3517-based COM with only customary decoupling, nothing special. The trouble has been that those vendor specs appear to be a bit "dreamy" to say the least. We've measured a number of variances in the supply, and have had to install some loading resistors on the rails for now just to ease an otherwise very nasty ripple. So we do have all that
 going on and expect that could the core of this. More recently, we also started detecting high frequency transients that seem to show up on all the rails at 20 - 30 minute intervals or during heavy usage. These signals are in the 80 - 100MHz range, and exist for only a few pulses. Though the board appears to keep trucking when they occur, there's no way those can be good. We've yet to determine where those are coming from. So, all those things are being looked at very closely right now.

To answer your questions:  We do have a few boards in the hands of a few engineers, and all of them are seeing similar stability and performance issues, again very sporadically. Since they appear for a while and then disappear for a much longer while, it's been incredibly hard to characterize in any sense of the word. Regarding adding capacitance or other filtering, our hardware engineer is looking at that right now. As far as the voltages are concerned, I don't believe we have a lot of control over the 5V, 3.3V, etc. as they are basically just sourced directly from the supply.

Regarding the EMAC:  Has anyone else got an pair of AM3517-based somethings laying around that they may have run iperf between?  If direct connected with either a crossover cable or a decent 100Mbps switch, can anyone get the full expected 85+Mbps one would expect with good TCP on a 100Mbps link?  As noted in the other post, the best we can do between our EMACs is rough 50 - 70Mbps. If we connect that same EMAC, through the same switch, to a non-EMAC NIC, we fly right up to the full 85+Mbps mark and stay there. It's incredibly odd... I'd be very interested in knowing what others are seeing.

Thanks again for your comments!

----- Original Message -----
From: jean-philippe francois <jp.francois@xxxxxxxxxx>
To: CF Adad <cfadad@xxxxxxxxxxxxxx>
Cc: "Mohammed, Afzal" <afzal@xxxxxx>; "linux-omap@xxxxxxxxxxxxxxx" <linux-omap@xxxxxxxxxxxxxxx>; Tony Lindgren <tony@xxxxxxxxxxx>; "Shilimkar, Santosh" <santosh.shilimkar@xxxxxx>
Sent: Thursday, June 14, 2012 3:10 PM
Subject: Re: Please help! AM35xx mm/slab.c BUG

2012/6/14 CF Adad <cfadad@xxxxxxxxxxxxxx>:
> An update:
>
> *LAN9221 and GPMC off the hook?*
>
> We've isolated the GPMC away from this I believe by disabling the LAN9221 in both our bootloaders and the kernel and by booting everything off the SD cards only.  By removing it's initialization code from the respective board files, I *hope* we've basically removed it from contention.  Obviously the chip is still wired up, but I don't expect the bootloaders or kernel to be trying to talk to it.  Likewise, the NAND is being initialized, but we're not mouting or using it at all.
>
> With these changes we're still seeing these crashes, albeit with the same incredible lack of frequency.
>
> *EMAC now _partially_ on the hook?*
> I posted a seperate thread on what I think may be a related subject, potential Davinci EMAC problems, here:  http://www.spinics.net/lists/linux-omap/msg71833.html.
>
> As you can see from the crashes posted there, there seems to be a bit of whining from the EMAC driver.  Since performance in the EMAC <=> EMAC case has always been questionable anyway, any chance there is a tiny memory leak or something similar that could be contributing?
>
> What about configuring this EMAC from within u-boot?  Could that initialization do something bad when we get into Linux?  I've not touched these drivers.  I've simply called them like other boards in the family are doing.
>
> Just this morning, I upgraded to the latest linux-omap 3.5-rc2, but still saw one of these crashes pretty quickly...
>
> *Power stability?*
>
> We're learning through all of this that our boards do appear to have some funny transients running through the power circuits every so often.  The ones we've captured on the scope have not caused crashes or hard lockups, but they are there.  This could be a dumb question, but could power issues create a slab error like this???  I guess I'm more accustomed to seeing power issues result in more hard lock ups than a nicely worded dump with the kernel sometimes still somewhat functioning.
>
>
I am following this bug with interest, because we often go the "custom
hardware" way, and have faced situation like these.
In my opinion, random memory corruption is more than often the sign of an
hardware design issue. EMAC here is perhaps only a symptom, because
it provides the proper memory bandwith and power consumption pattern that
triggers the glitch.

How many board do you have ? Are some more stable than others ?
Can you solder additional caps on top of your power decoupling caps ?
Can you tweak the voltages ?

> Can anyone suggest to me anything I may not have tried to get more information out of these crashes when they occur?
>
> Thanks again to all!
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html