Hi Tony, Thanks so much for the response! All good suggestions. #1.) Missing retention/off idle workarounds I'm highly suspect of this one. I've seen a lot of patches addressing things in this category come out recently for the Sitara series, and we've tried to incorporate everything we've seen. We also rebased our tree off the linux-omap masteras recently as May 17th. As I mentioned in the first post, I hope to do this again soon, perhaps today even, to pull in all the good work you folks have done bringing us up to the RCs of 3.5. Since we discovered the "nohlt" option, we've added it to our default kernel command line and have been using with it. For a while, I thought maybe that had fixed the glitch, but then yesterday came along... That crash from the first message occured with 'nohlt' enabled. #2.) Broken Memory We really hammered this one as well, as TechNexion delivered our boards with 256MB of NANYA NT5TU64M16GG–AC RAM. Since we were unfamiliar with that part, we rolled up our sleeves and evaluated every timing and configuration paramter in x-loader using the EMIF4 settings calculator spreadsheet provided by TI. We also have been running cycles of "memtester 200M" calls, and the board seems to hold up fine under that with both the default, very conservative timings and the more optimized ones we determinded with the TI sheet. I'll give your suggestion of limiting the memory a shot and see if that makes a difference. Several of our older captures were run with SLAB_DEBUG set, but it seemed at the time that we weren't getting any more info out of that so we disabled it. I'll re-enable. #3.) Software bugs We're certainly not opposed to the idea that we're doing something wrong. :) In fact, that would almost seem likely at this point. A few other things that may be helpful: * Could these issues be related to our GPMC? We're using the SMSC LAN9221 on our board, not the slower LAN9220 that it seems all the AM35xx dev. kits are using. Frankly, the fastest we could get with that chip was ~40Mbps with a ~1-2% packet loss. :-( So, we stepped up to the faster LAN9221 that's used by Gumstix and several others on the OMAP series. It's running super-well right now (> 80Mbps with 0% loss) with the faster GPMC timings and configuration provided with the Gumstix source. Is there perhaps a reason all the AM35xx boards were using the LAN9220 instead? We assumed the AM35xx GPMC was essentially as capable as the OMAP's. Was that a faulty assumption? Speaking of GPMC, our NAND that Technexion is delivering requires a 4-bit ECC. As support for that seems spotty at the moment in the various bootloader and kernel configurations, we finally punted and simply used Micron's on-die engine to do it. It appears stable, and we've done various filesystem burn-in tests to stress it. At little while back we also rigged a combination nandtest + iperf across the SMSC to really stress the GPMC. This too ran fine for several iterations. *DaVinci EMAC?: Perhaps it's just my latest thought-of-the-day, but since I saw so many of these things yesterday while focusing on Ethernet work, after seeing none for the past several days doing other work, I can't help but think it may be related to the networks somehow. Some of our TAM3517's do not have the SMSC hooked up to them. They are just using their EMAC adapters, but they have exhibited these SLAB crashes too. So, maybe it's the EMAC? We've noticed that when we run bandwdith tests between a pair of EMACs using iperf, we get a pretty reduced data rate, maybe 60Mbps. There is also the occasional dropped packet. When we connect and EMAC to another port, say a laptop or a Gumstix SMSC, we get blazing performance. That seems very odd. It's like the driver is more than capable of producing those high-class speeds, but when two of them get together they agree to dog it. Could this maybe be related??? Thanks again for you time and help! ----- Original Message ----- From: Tony Lindgren <tony@xxxxxxxxxxx> To: CF Adad <cfadad@xxxxxxxxxxxxxx> Cc: "linux-omap@xxxxxxxxxxxxxxx" <linux-omap@xxxxxxxxxxxxxxx> Sent: Tuesday, June 5, 2012 3:08 AM Subject: Re: Please help! AM35xx mm/slab.c BUG * CF Adad <cfadad@xxxxxxxxxxxxxx> [120604 23:47]: > All, > > I'm **really** hoping someone out there can help us with this. > > My team has been working with the AM3517 for several months now, and we seem to be plagued every so often by what we have termed the "slab bug". In short, it looks something like the pasted bootlog below. This has been an *incredibly* hard bug to figure out. We have a couple of different AM3517-based platforms at our disposal, but the one we see the issue on almost exclusively is a custom, prototype baseboard designed around the TechNexion TAM3157. Over the last several months, we have tried several versions of the Linux off the linux-omap tree, with loads of different configurations, and even different bootloader versions and combinations. We've spent most of our time with a linux-omap snapshot that was a 3.2-rc6, and more recently a 3.4-rc6 from late a week or two back. (Tomorrow I anticipate pulling the latest 3.5 now that I see it's out.) In all cases, since we switched to 3.0+, we've seen these errors. > > They are *very* inconsistent in when they occur, but they happen often enough to be very frustrating. Consequently, our team has had an incredibly difficult time tracking what's causing them. They seem to occur at random, perhaps on average once every handful of days. We've messed with everything we can think of from tweaking kernel options (like enabling/disabling preemption), to disabling various drivers and userspace components, to reviewing every single line in any of our board files. We have tried different versions and combinations of the OS and both bootloaders (x-loader & u-boot), and even went so far as to do a full analysis of the RAM timings in the EMIF4. Unfortunately, nothing so far has worked. The error occurs when operating off both the SD/MMC and the NAND devices, with or without the Ethernets (LAN9221 & EMAC) up and/or running, with or without PREEMPT, under heavy load and sometimes just idling, ... There is simply nothing > consistent about it. After probably 2 weeks without seeing one, I saw 3 today. > > Though the error's occurence is inconistent, the error itself is. It always throws an internal OOPs at the following section of code in mm/slab.c: > --- > /* > * The slab was either on partial or free list so > * there must be at least one object available for > * allocation. > */ > BUG_ON(slabp->inuse >= cachep->num); > --- > (It appears this was patched in eons ago: https://lkml.org/lkml/2007/2/19/20. So it's nothing new.) I can think of at least three issues causing errors like this: 1. Missing retention/off idle workarounds You can test this one by booting with nohlt cmdline option and seeing if that helps. 2. Broken memory I've seen at least one case of this where things would work fine if only half of the memory was in use and devices would oops at random point within a week. To test for this you can pass cmdline options to artifically partition the memory and leave out some chunks to see if that helps. Or boot with mem=xxxM set to half of the physical memory. And run your tests with SLAB_DEBUG set. 3. Software bugs My experience is that things are behaving very reliably regarding cache and highmem, so I would check #1 and #2 fist. Regards, Tony -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html