Re: Please help! AM35xx mm/slab.c BUG

CF Adad <cfadad@xxxxxxxxxxxxxx> · Tue, 5 Jun 2012 23:14:58 -0700 (PDT)

All,

We've learned a few more things:

1.) We have found a way to get it to happen pretty consistently.  We simply run iperf in a loop using the EMAC port to some other device.

2.) The crash ONLY happens on our custom board, not on the Twister dev kit.  This is true despite the fact that I ported our latest linux-omap 3.4-rc6 over there.  We're still running Technexion's default x-loader and u-boot to handle proper configs on that board. So, that's a substantial bit of code that is different between our boxes.  The kernel is altered only in that the few pinmux changes I left in Linux have been removed to avoid configuration differences between the two boards.

This suggests that either:
A) We have a hardware problem on our board.  Seems unlikely.  Can anyone think of anything hardware related that would manifest itself with these sorts of errors?

B) We have a issue in our bootloader code somehwere.  I hesitated to overwrite the bootloaders for this test on the Twister baseboard just because I did not want to have to mess with getting the pinmux's and the like put back and such.

Presuming something in those bootloaders is our problem, I wonder what EMAC-related stuff there really is.  For a long time we ran with our bootloaders NOT initializing either of the Eths.  This was Technexion's default.  They left that work to Linux.  We've recently done work to enable them in u-boot, but we were crashing like this long before that.  Once in Linux, we're just using the standard drivers and calls from within the board file to SMSC911x and the Davinci EMAC drivers.  I am using the patches that allow the e-fused MAC to be pulled from the AM35xx for the EMAC, but I can't see how that would cause this.

Assuming the EMAC is perhaps an innocent bystander that happens just to cause this, the place I would have to suspect the most in our bootloaders would be the GPMC settings.  We've done a good bit of tweaking in there since we switched chips.  *Could a GPMC timing issue account for these types of errors???*  The reason I bring it up is that the GPMC has been one of those things that we've really struggled to understand.  What should the timings *really* be?  We've done the best we can to try to guess our way through it.  BUT, we could certainly be very wrong.  If a GPMC setting could cause these types of bugs, please let me know.  I'll be happy to post more info on how we're setting that up now.  In case not, I'll save the electrons and not spam it here.

Thanks again for all your help!

PS -- If it's useful, here is our latest crash, with SLAB debugging enabled:

[ 5278.124023] slab: Internal list corruption detected in cache 'skbuff_head_cache'(20), slabp cecbb040(4). Tainted(Not tainted). Hex:
[ 5278.136840] 00000000: 00 01 10 00 00 02 20 00 b0 00 00 00 b0 b0 cb ce  ...... .........
[ 5278.145263] 00000010: 04 00 00 00 11 00 00 00 00 00 6b 6b 0f 00 00 00  ..........kk....
[ 5278.153686] 00000020: 03 00 00 00 0c 00 00 00 09 00 00 00 fe ff ff ff  ................
[ 5278.162078] 00000030: fd ff ff ff fd ff ff ff fd ff ff ff 10 00 00 00  ................
[ 5278.170501] 00000040: 02 00 00 00 13 00 00 00 00 00 00 00 ff ff ff ff  ................
[ 5278.178924] 00000050: 00 00 00 00 0b 00 00 00 0d 00 00 00 0a 00 00 00  ................
[ 5278.187316] 00000060: 12 00 00 00 0e 00 00 00 01 00 00 00              ............
[ 5278.195404] ------------[ cut here ]------------
[ 5278.200256] kernel BUG at mm/slab.c:3114!
[ 5278.204467] Internal error: Oops - BUG: 0 [#1] ARM
[ 5278.209503] Modules linked in:
[ 5278.212707] CPU: 0    Not tainted  (3.4.0-rc6 #2)
[ 5278.217681] PC is at check_slabp+0xe4/0xf4
[ 5278.222015] LR is at console_unlock+0x174/0x214
[ 5278.226776] pc : [<c00c3b08>]    lr : [<c002f8e0>]    psr: 80000093
[ 5278.226806] sp : cf83fc40  ip : 00000070  fp : cf83fc74
[ 5278.238861] r10: cecbb3b0  r9 : c04f91c0  r8 : cf812800
[ 5278.244354] r7 : 00000004  r6 : cecbb040  r5 : 00000014  r4 : c0486154
[ 5278.251220] r3 : c0508718  r2 : 20000093  r1 : 00000001  r0 : 0000005d
[ 5278.258117] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[ 5278.265716] Control: 10c5387d  Table: 8eda0019  DAC: 00000015
[ 5278.271759] Process iperf (pid: 1434, stack limit = 0xcf83e2f0)
[ 5278.277984] Stack: (0xcf83fc40 to 0xcf840000)
[ 5278.282562] fc40: 00000001 cecbb040 0000006c 00000001 cf83fca4 cecbb040 cf812800 cf813a00
[ 5278.291168] fc60: 00000005 cf816464 cf83fcbc cf83fc78 c00c4dbc c00c3a30 cf83fccc 00000000
[ 5278.299804] fc80: 00000010 00200200 00100100 00000000 cf83fce4 cf813a00 00000010 cf816440
[ 5278.308410] fca0: 00000000 cf812800 00000b90 cef89670 cf83fce4 cf83fcc0 c037ea2c c00c4cc0
[ 5278.317016] fcc0: cf81287c cf812800 cf816440 cef89678 c02ec07c 60000013 cf83fd0c cf83fce8
[ 5278.325653] fce0: c00c4b18 c037e998 cef89678 000005a8 000005a8 cedf981c cedf9500 00000000
[ 5278.334259] fd00: cf83fd24 cf83fd10 c02ec07c c00c4a3c cedf953c cef89678 cf83fd84 cf83fd28
[ 5278.342864] fd20: c0325ec0 c02ec034 cf83fd6c c00c3fd0 c03172d4 c03177a0 cfa52800 00000001
[ 5278.351470] fd40: cf83fecc 00000000 00000000 00001470 c05335c0 7fffffff cf83fd94 c0530610
[ 5278.360107] fd60: cf83fecc 00000000 cf621cf0 00000000 cf83fecc 00002000 cf83fdbc cf83fd88
[ 5278.368713] fd80: c0343aa4 c03258a8 00000000 00000000 cf83fd9c cfa74f40 d08b4d80 00000000
[ 5278.377319] fda0: 000006fe 00000000 00000000 00000000 cf83feb4 cf83fdc0 c02e3234 c0343a64
[ 5278.385955] fdc0: 00000000 cfa4ff40 cf83fe1c cf83fdd8 00000000 00002000 cf621cf0 cecbbbf8
[ 5278.394561] fde0: 00000000 cf83fecc cecb9d78 60000113 cfa52c80 cfa52800 cecb9d78 837fee5f
[ 5278.403167] fe00: 00000000 000005ea c05335c0 cecbbbf8 00000000 00000001 ffffffff 00000000
[ 5278.411773] fe20: 00000000 00000000 00000000 00000000 cedc60c0 cfa74f40 00000000 00000000
[ 5278.420410] fe40: c0287e54 c0286b0c cf83fdc8 00000000 d26d4d80 d08d0660 cfa52800 cf83e000
[ 5278.429016] fe60: cf83fe8c cf83fe70 c0287f2c c0288a98 00000001 c02e50d0 cf83fe94 cf83fe88
[ 5278.437622] fe80: c0026c64 c02e33dc cf83febc 00002000 cf621cf0 00000000 cf83fee8 00000000
[ 5278.446258] fea0: cf83e000 00082ee0 cf83ff8c cf83feb8 c02e508c c02e3188 00000001 fffffff7
[ 5278.454864] fec0: 00000001 00083a70 00001470 cf83fee8 00000080 cf83fec4 00000001 00000000
[ 5278.463470] fee0: 00000000 00000001 00000003 c0034d8c 00000100 00000000 00000003 00000010
[ 5278.472076] ff00: cf83ff54 cf83ff10 c0034d8c c0034428 00000044 03419fc0 00000000 0000000a
[ 5278.480712] ff20: c05468c0 00000100 c007632c cf83e000 00000044 c0035218 c050b338 cf83e000
[ 5278.489318] ff40: 00000044 00000000 cf83ff6c cf83ff58 c0035218 c0079544 c007633c c0522b6c
[ 5278.497924] ff60: cf83ff8c cf83ff70 00082ec8 00084ee8 00082ee0 00000123 c000e9c4 00000000
[ 5278.506561] ff80: cf83ffa4 cf83ff90 c02e5104 c02e5000 00000000 00000000 00000000 cf83ffa8
[ 5278.515167] ffa0: c000e780 c02e50e8 00082ec8 00084ee8 00000004 00082ee0 00002000 00000000
[ 5278.523773] ffc0: 00082ec8 00084ee8 00082ee0 00000123 0346bfc0 00000000 00002000 b5ce4f9c
[ 5278.532379] ffe0: 00000000 b5ce4d98 b6e90788 b6e91394 80000010 00000004 6b6b6b6b a56b6b6b
[ 5278.540985] Backtrace: 
[ 5278.543579] [<c00c3a24>] (check_slabp+0x0/0xf4) from [<c00c4dbc>] (free_block+0x108/0x20c)
[ 5278.552276]  r8:cf816464 r7:00000005 r6:cf813a00 r5:cf812800 r4:cecbb040
[ 5278.559387] [<c00c4cb4>] (free_block+0x0/0x20c) from [<c037ea2c>] (cache_flusharray+0xa0/0xfc)
[ 5278.568420] [<c037e98c>] (cache_flusharray+0x0/0xfc) from [<c00c4b18>] (kmem_cache_free+0xe8/0xf0)
[ 5278.577850]  r8:60000013 r7:c02ec07c r6:cef89678 r5:cf816440 r4:cf812800
[ 5278.584747] r3:cf81287c
[ 5278.587524] [<c00c4a30>] (kmem_cache_free+0x0/0xf0) from [<c02ec07c>] (__kfree_skb+0x54/0xcc)
[ 5278.596496] [<c02ec028>] (__kfree_skb+0x0/0xcc) from [<c0325ec0>] (tcp_recvmsg+0x624/0x864)
[ 5278.605285]  r4:cef89678 r3:cedf953c
[ 5278.609069] [<c032589c>] (tcp_recvmsg+0x0/0x864) from [<c0343aa4>] (inet_recvmsg+0x4c/0x60)
[ 5278.617858] [<c0343a58>] (inet_recvmsg+0x0/0x60) from [<c02e3234>] (sock_recvmsg+0xb8/0xd8)
[ 5278.626617]  r6:00000000 r5:00000000 r4:00000000
[ 5278.631500] [<c02e317c>] (sock_recvmsg+0x0/0xd8) from [<c02e508c>] (sys_recvfrom+0x98/0xe8)
[ 5278.640289] [<c02e4ff4>] (sys_recvfrom+0x0/0xe8) from [<c02e5104>] (sys_recv+0x28/0x30)
[ 5278.648712] [<c02e50dc>] (sys_recv+0x0/0x30) from [<c000e780>] (ret_fast_syscall+0x0/0x30)
[ 5278.657409] Code: e58d3008 e3a03010 e59f100c eb04f0a3 (e7f001f2) 
[ 5278.668273] ---[ end trace 018554de1af4a1fa ]---
[ 5300.147521] slab: Internal list corruption detected in cache 'skbuff_head_cache'(20), slabp cee4a000(12). Tainted(Tainted: G      :
[ 5300.161437] 00000000: 00 50 d8 ce 00 3a 81 cf 70 00 00 00 70 a0 e4 ce  .P...:..p...p...
[ 5300.169860] 00000010: 0c 00 00 00 07 00 00 00 00 00 6b 6b fd ff ff ff  ..........kk....
[ 5300.178283] 00000020: 05 00 00 00 fd ff ff ff fd ff ff ff fd ff ff ff  ................
[ 5300.186676] 00000030: 06 00 00 00 0a 00 00 00 fd ff ff ff 01 00 00 00  ................
[ 5300.195098] 00000040: fd ff ff ff ff ff ff ff 08 00 00 00 fd ff ff ff  ................
[ 5300.203521] 00000050: fd ff ff ff fd ff ff ff fd ff ff ff fd ff ff ff  ................
[ 5300.211914] 00000060: fd ff ff ff fd ff ff ff fd ff ff ff              ............
[ 5300.220001] ------------[ cut here ]------------
[ 5300.224853] kernel BUG at mm/slab.c:3114!
[ 5300.229064] Internal error: Oops - BUG: 0 [#2] ARM
[ 5300.234100] Modules linked in:
[ 5300.237304] CPU: 0    Tainted: G      D       (3.4.0-rc6 #2)
[ 5300.243286] PC is at check_slabp+0xe4/0xf4
[ 5300.247589] LR is at console_unlock+0x174/0x214
[ 5300.252349] pc : [<c00c3b08>]    lr : [<c002f8e0>]    psr: 80000193
[ 5300.252380] sp : c04efc98  ip : 00000070  fp : c04efccc
[ 5300.264434] r10: cf812800  r9 : fffffffe  r8 : cf812800
[ 5300.269927] r7 : 0000000c  r6 : cee4a000  r5 : 00000014  r4 : c0486154
[ 5300.276763] r3 : c0508718  r2 : 20000193  r1 : 00000001  r0 : 0000005d
[ 5300.283630] Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[ 5300.291412] Control: 10c5387d  Table: 8eda0019  DAC: 00000015
[ 5300.297454] Process swapper (pid: 0, stack limit = 0xc04ee2f0)
[ 5300.303588] Stack: (0xc04efc98 to 0xc04f0000)
[ 5300.308166] fc80:                                                       00000001 cee4a000
[ 5300.316741] fca0: 0000006c 00000001 00000009 cee4a000 00000004 0000000c cecb905c cf816440
[ 5300.325347] fcc0: c04efd24 c04efcd0 c037e398 c00c3a30 c04efd74 fb0000e0 00000000 cf813a00
[ 5300.333953] fce0: 00000020 00000020 00000000 00000020 00200200 00100100 c0317ac8 cf812800
[ 5300.342559] fd00: 60000113 00000020 c02eb484 00000020 c05335c0 00000000 c04efd54 c04efd28
[ 5300.351165] fd20: c00c4784 c037e244 cfa52800 00000008 cfa52800 00000020 cf812800 00000634
[ 5300.359741] fd40: c02eba04 00000000 c04efd7c c04efd58 c02eb484 c00c463c cfa52800 cecb3978
[ 5300.368347] fd60: 8385d0e8 00000000 00000073 cecb3978 c04efd94 c04efd80 c02eba04 c02eb454
[ 5300.376953] fd80: cfa52c80 cfa52800 c04efdac c04efd98 c0285c74 c02eb9e4 019caec5 cfa52800
[ 5300.385559] fda0: c04efdd4 c04efdb0 c0286b74 c0285c58 c0017b30 c0548030 cfa4ff40 cfa4ff40
[ 5300.394165] fdc0: 60000113 cfa74f40 c04efdfc c04efdd8 c0287e54 c0286b0c cfa4ff40 00000000
[ 5300.402740] fde0: d26d4260 d08d0660 cfa52800 c04ee000 c04efe1c c04efe00 c0287f2c c0287db0
[ 5300.411346] fe00: 00000000 cfa74f40 00000040 00000040 c04efe3c c04efe20 c0288a98 c0287e6c
[ 5300.419952] fe20: 00000001 cfa52c8c 00000001 00000001 c04efe64 c04efe40 c0287048 c0288a58
[ 5300.428558] fe40: c0286fac cfa52c8c 00000001 00000040 0000012c c0509978 c04efe9c c04efe68
[ 5300.437164] fe60: c02f5694 c0286fb8 00000001 0009c410 c04efebc 00000001 00000003 0000000c
[ 5300.445739] fe80: c05468d0 c05468cc 411fc087 c04ee000 c04efee4 c04efea0 c0034d1c c02f55f0
[ 5300.454345] fea0: 00000044 c04fa1c0 411fc087 0000000a c05468c0 00000100 c007632c c04ee000
[ 5300.462951] fec0: 00000044 00000000 00000044 c04fa1c0 411fc087 00000000 c04efefc c04efee8
[ 5300.471557] fee0: c0035220 c0034c78 c007633c c0522b6c c04eff1c c04eff00 c000f0d0 c00351a0
[ 5300.480163] ff00: 00000044 fa200000 c04eff40 c0535aa0 c04eff3c c04eff20 c00085cc c000f098
[ 5300.488769] ff20: c000f448 20000013 ffffffff c04eff74 c04effac c04eff40 c000e3c0 c0008564
[ 5300.497344] ff40: 00000000 00000000 00000000 00000001 c04ee000 c04ee000 c05351c8 c04ee000
[ 5300.505950] ff60: c04fa1c0 411fc087 00000000 c04effac c04eff30 c04eff88 c00794c8 c000f448
[ 5300.514556] ff80: 20000013 ffffffff 00000000 c04f6f68 c0535140 00000000 c078b140 80004059
[ 5300.523162] ffa0: c04effbc c04effb0 c03753b4 c000f40c c04efff4 c04effc0 c04b179c c0375358
[ 5300.531768] ffc0: 00000000 00000000 c04b12e0 00000000 00000000 c04d5194 10c5387d c04f608c
[ 5300.540344] ffe0: c04d5190 c04fa1b4 00000000 c04efff8 80008040 c04b155c 00000000 00000000
[ 5300.548950] Backtrace: 
[ 5300.551544] [<c00c3a24>] (check_slabp+0x0/0xf4) from [<c037e398>] (cache_alloc_refill+0x160/0x754)
[ 5300.560974]  r8:cf816440 r7:cecb905c r6:0000000c r5:00000004 r4:cee4a000
[ 5300.568054] [<c037e238>] (cache_alloc_refill+0x0/0x754) from [<c00c4784>] (kmem_cache_alloc+0x154/0x164)
[ 5300.578033] [<c00c4630>] (kmem_cache_alloc+0x0/0x164) from [<c02eb484>] (__alloc_skb+0x3c/0xfc)
[ 5300.587188] [<c02eb448>] (__alloc_skb+0x0/0xfc) from [<c02eba04>] (__netdev_alloc_skb+0x2c/0x54)
[ 5300.596435] [<c02eb9d8>] (__netdev_alloc_skb+0x0/0x54) from [<c0285c74>] (emac_rx_alloc+0x28/0x64)
[ 5300.605865]  r4:cfa52800 r3:cfa52c80
[ 5300.609619] [<c0285c4c>] (emac_rx_alloc+0x0/0x64) from [<c0286b74>] (emac_rx_handler+0x74/0x11c)
[ 5300.618865]  r4:cfa52800 r3:019caec5
[ 5300.622619] [<c0286b00>] (emac_rx_handler+0x0/0x11c) from [<c0287e54>] (__cpdma_chan_free+0xb0/0xbc)
[ 5300.632232]  r6:cfa74f40 r5:60000113 r4:cfa4ff40
[ 5300.637115] [<c0287da4>] (__cpdma_chan_free+0x0/0xbc) from [<c0287f2c>] (__cpdma_chan_process+0xcc/0x104)
[ 5300.647186] [<c0287e60>] (__cpdma_chan_process+0x0/0x104) from [<c0288a98>] (cpdma_chan_process+0x4c/0x64)
[ 5300.657318]  r7:00000040 r6:00000040 r5:cfa74f40 r4:00000000
[ 5300.663299] [<c0288a4c>] (cpdma_chan_process+0x0/0x64) from [<c0287048>] (emac_poll+0x9c/0x208)
[ 5300.672424]  r6:00000001 r5:00000001 r4:cfa52c8c r3:00000001
[ 5300.678405] [<c0286fac>] (emac_poll+0x0/0x208) from [<c02f5694>] (net_rx_action+0xb0/0x1a8)
[ 5300.687194]  r8:c0509978 r7:0000012c r6:00000040 r5:00000001 r4:cfa52c8c
[ 5300.694061] r3:c0286fac
[ 5300.696838] [<c02f55e4>] (net_rx_action+0x0/0x1a8) from [<c0034d1c>] (__do_softirq+0xb0/0x1d8)
[ 5300.705902] [<c0034c6c>] (__do_softirq+0x0/0x1d8) from [<c0035220>] (irq_exit+0x8c/0x94)
[ 5300.714416] [<c0035194>] (irq_exit+0x0/0x94) from [<c000f0d0>] (handle_IRQ+0x44/0x94)
[ 5300.722656]  r4:c0522b6c r3:c007633c
[ 5300.726409] [<c000f08c>] (handle_IRQ+0x0/0x94) from [<c00085cc>] (omap3_intc_handle_irq+0x74/0x84)
[ 5300.735839]  r6:c0535aa0 r5:c04eff40 r4:fa200000 r3:00000044
[ 5300.741821] [<c0008558>] (omap3_intc_handle_irq+0x0/0x84) from [<c000e3c0>] (__irq_svc+0x40/0x60)
[ 5300.751129] Exception stack(0xc04eff40 to 0xc04eff88)
[ 5300.756439] ff40: 00000000 00000000 00000000 00000001 c04ee000 c04ee000 c05351c8 c04ee000
[ 5300.765045] ff60: c04fa1c0 411fc087 00000000 c04effac c04eff30 c04eff88 c00794c8 c000f448
[ 5300.773651] ff80: 20000013 ffffffff
[ 5300.777313]  r7:c04eff74 r6:ffffffff r5:20000013 r4:c000f448
[ 5300.783294] [<c000f400>] (cpu_idle+0x0/0xb8) from [<c03753b4>] (rest_init+0x68/0x80)
[ 5300.791412]  r8:80004059 r7:c078b140 r6:00000000 r5:c0535140 r4:c04f6f68
[ 5300.798309] r3:00000000
[ 5300.801055] [<c037534c>] (rest_init+0x0/0x80) from [<c04b179c>] (start_kernel+0x24c/0x290)
[ 5300.809753] [<c04b1550>] (start_kernel+0x0/0x290) from [<80008040>] (0x80008040)
[ 5300.817535] Code: e58d3008 e3a03010 e59f100c eb04f0a3 (e7f001f2) 
[ 5300.824005] ---[ end trace 018554de1af4a1fb ]---
[ 5300.828887] Kernel panic - not syncing: Fatal exception in interrupt

----- Original Message -----
From: CF Adad <cfadad@xxxxxxxxxxxxxx>
To: Tony Lindgren <tony@xxxxxxxxxxx>
Cc: "linux-omap@xxxxxxxxxxxxxxx" <linux-omap@xxxxxxxxxxxxxxx>
Sent: Tuesday, June 5, 2012 12:29 PM
Subject: Re: Please help!  AM35xx mm/slab.c BUG

Hi Tony,

Thanks so much for the response!  All good suggestions.

#1.) Missing retention/off idle workarounds
I'm highly suspect of this one.  I've seen a lot of patches addressing things in this category come out recently for the Sitara series, and we've tried to incorporate everything we've seen.  We also rebased our tree off the linux-omap masteras recently as May 17th.  As I mentioned in the first post, I hope to do this again soon, perhaps today even, to pull in all the good work you folks have done bringing us up to the RCs of 3.5.

Since we discovered the "nohlt" option, we've added it to our default kernel command line and have been using with it.  For a while, I thought maybe that had fixed the glitch, but then yesterday came along...  That crash from the first message occured with 'nohlt' enabled.

#2.) Broken Memory
We really hammered this one as well, as TechNexion delivered our boards with 256MB of NANYA NT5TU64M16GG–AC RAM.  Since we were unfamiliar with that part, we rolled up our sleeves and evaluated every timing and configuration paramter in x-loader using the EMIF4 settings calculator spreadsheet provided by TI.  We also have been running cycles of "memtester 200M" calls, and the board seems to hold up fine under that with both the default, very conservative timings and the more optimized ones we determinded with the TI sheet.

I'll give your suggestion of limiting the memory a shot and see if that makes a difference.  Several of our older captures were run with SLAB_DEBUG set, but it seemed at the time that we weren't getting any more info out of that so we disabled it.  I'll re-enable.

#3.) Software bugs
We're certainly not opposed to the idea that we're doing something wrong.  :)  In fact, that would almost seem likely at this point.

A few other things that may be helpful:

* Could these issues be related to our GPMC?
We're using the SMSC LAN9221 on our board, not the slower LAN9220 that it seems all the AM35xx dev. kits are using.  Frankly, the fastest we could get with that chip was ~40Mbps with a ~1-2% packet loss.  :-(  So, we stepped up to the faster LAN9221 that's used by Gumstix and several others on the OMAP series.  It's running super-well right now (> 80Mbps with 0% loss) with the faster GPMC timings and configuration provided with the Gumstix source.  Is there perhaps a reason all the AM35xx boards were using the LAN9220 instead?  We assumed the AM35xx GPMC was essentially as capable as the OMAP's.  Was that a faulty assumption?

Speaking of GPMC, our NAND that Technexion is delivering requires a 4-bit ECC.  As support for that seems spotty at the moment in the various bootloader and kernel configurations, we finally punted and simply used Micron's on-die engine to do it.  It appears stable, and we've done various filesystem burn-in tests to stress it.  At little while back we also rigged a combination nandtest + iperf across the SMSC to really stress the GPMC.  This too ran fine for several iterations.

*DaVinci EMAC?:
Perhaps it's just my latest thought-of-the-day, but since I saw so many of these things yesterday while focusing on Ethernet work, after seeing none for the past several days doing other work, I can't help but think it may be related to the networks somehow.  Some of our TAM3517's do not have the SMSC hooked up to them.  They are just using their EMAC adapters, but they have exhibited these SLAB crashes too.  So, maybe it's the EMAC?

We've noticed that when we run bandwdith tests between a pair of EMACs using iperf, we get a pretty reduced data rate, maybe 60Mbps.  There is also the occasional dropped packet.  When we connect and EMAC to another port, say a laptop or a Gumstix SMSC, we get blazing performance.  That seems very odd.  It's like the driver is more than capable of producing those high-class speeds, but when two of them get together they agree to dog it.  Could this maybe be related???

Thanks again for you time and help!

----- Original Message -----
From: Tony Lindgren <tony@xxxxxxxxxxx>
To: CF Adad <cfadad@xxxxxxxxxxxxxx>
Cc: "linux-omap@xxxxxxxxxxxxxxx" <linux-omap@xxxxxxxxxxxxxxx>
Sent: Tuesday, June 5, 2012 3:08 AM
Subject: Re: Please help!  AM35xx mm/slab.c BUG

* CF Adad <cfadad@xxxxxxxxxxxxxx> [120604 23:47]:
> All,
> 
> I'm **really** hoping someone out there can help us with this.
> 
> My team has been working with the AM3517 for several months now, and we seem to be plagued every so often by what we have termed the "slab bug".  In short, it looks something like the pasted bootlog below.  This has been an *incredibly* hard bug to figure out.  We have a couple of different AM3517-based platforms at our disposal, but the one we see the issue on almost exclusively is a custom, prototype baseboard designed around the TechNexion TAM3157.  Over the last several months, we have tried several versions of the Linux off the linux-omap tree, with loads of different configurations, and even different bootloader versions and combinations.  We've spent most of our time with a linux-omap snapshot that was a 3.2-rc6, and more recently a 3.4-rc6 from late a week or two back.  (Tomorrow I anticipate pulling the latest 3.5 now that I see it's out.)  In all cases, since we switched to 3.0+, we've seen these errors.
> 
> They are *very* inconsistent in when they occur, but they happen often enough to be very frustrating.  Consequently, our team has had an incredibly difficult time tracking what's causing them.  They seem to occur at random, perhaps on average once every handful of days.  We've messed with everything we can think of from tweaking kernel options (like enabling/disabling preemption), to disabling various drivers and userspace components, to reviewing every single line in any of our board files.  We have tried different versions and combinations of the OS and both bootloaders (x-loader & u-boot), and even went so far as to do a full analysis of the RAM timings in the EMIF4.  Unfortunately, nothing so far has worked.  The error occurs when operating off both the SD/MMC and the NAND devices, with or without the Ethernets (LAN9221 & EMAC) up and/or running, with or without PREEMPT, under heavy load and sometimes just idling, ...  There is simply nothing
>  consistent about it.  After probably 2 weeks without seeing one, I saw 3 today.
> 
> Though the error's occurence is inconistent, the error itself is.  It always throws an internal OOPs at the following section of code in mm/slab.c:
> ---
> /*
> * The slab was either on partial or free list so
> * there must be at least one object available for
> * allocation.
> */
> BUG_ON(slabp->inuse >= cachep->num);
> ---
> (It appears this was patched in eons ago: https://lkml.org/lkml/2007/2/19/20.  So it's nothing new.)

I can think of at least three issues causing errors like this:

1. Missing retention/off idle workarounds

   You can test this one by booting with nohlt cmdline option and
   seeing if that helps.

2. Broken memory

   I've seen at least one case of this where things would work
   fine if only half of the memory was in use and devices would
   oops at random point within a week. To test for this you can
   pass cmdline options to artifically partition the memory and
   leave out some chunks to see if that helps. Or boot with
   mem=xxxM set to half of the physical memory. And run your tests
   with SLAB_DEBUG set.

3. Software bugs

   My experience is that things are behaving very reliably regarding
   cache and highmem, so I would check #1 and #2 fist.

Regards,

Tony 
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html