slab corruption

Sagar Borikar <Sagar.Borikar@xxxxxxx> · Mon, 16 Apr 2012 15:31:09 -0700

Hi,

We are facing a slab corruption issue while using e1000 net driver in certain situations. Its very hard to reproduce the problem as well as though we know that what is corrupted, we can’t figure out why its corrupted.
If you could throw some light, it would be great.

Kindly note that this is legacy system which is actively used in enterprise backup applications and have several units on field. It has 2.6.23 kernel ( I agree its ancient but still we need to support ☹ )

The panic occurs like this:

11054 (E0)[   5865198.540571] ------------[ cut here ]------------
11055 (E2)[   5865198.605172] kernel BUG at mm/slab.c:2983!
11056 (U0)(MSG-KERN-00005):[   5865198.662501] invalid opcode: 0000 [1] SMP
11057 (E4)[   5865198.737975] CPU 15
11058 (E4)[   5865198.772608] Modules linked in: msr ipmi_watchdog 8021q mptctl dd_iosched(P) uhci_hcd ehci_hcd i2c_i801 i2c_core ipmi_si ipmi_devintf ipmi_msghandler usbcore dd
11059 (E4)[   5865199.132810] Pid: 5504, comm: ddfs Tainted: P   M    2.6.23-ddr272789 #1
11060 (E4)[   5865199.221161] RIP: 0010:[<ffffffff8028996b>]  [<ffffffff8028996b>] cache_alloc_refill+0x1bb/0x230
11061 (E4)[   5865199.334792] RSP: 0000:ffffc20006215cb0  EFLAGS: 00010046
11062 (E4)[   5865199.407682] RAX: 000000000000000f RBX: 0000000000000027 RCX: 0000000000000028
11063 (E4)[   5865199.502352] RDX: ffff8110182abd40 RSI: ffff81020b75e000 RDI: ffff8110182afcc0
11064 (E4)[   5865199.597019] RBP: ffff8103368d3000 R08: 0000000000000020 R09: ffff81018bc1b820
11065 (E4)[   5865199.691685] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000014
11066 (E4)[   5865199.786357] R13: ffff81101830d000 R14: ffff8110182abd40 R15: ffff8110182afcc0
11067 (E4)[   5865199.881150] FS:  0000000041ccc960(0063) GS:ffff811018269ec0(0000) knlGS:0000000000000000
11068 (E4)[   5865199.987328] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
11069 (E4)[   5865200.065439] CR2: 00002aaaaaaec000 CR3: 00000007919d1000 CR4: 00000000000006e0
11070 (E4)[   5865200.160170] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
11071 (E4)[   5865200.254939] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
11072 (E4)[   5865200.349703] Process ddfs (pid: 5504, threadinfo ffff810217072000, task ffff8102e43d6000)
11073 (E4)[   5865200.455880] Stack:  0000000000000000 ffff8110182abd50 0000002000000000 ffff81100127e000
11074 (E4)[   5865200.562070]  0000000000000000 0000000000000286 0000000000000020 ffff8110182afcc0
11075 (E4)[   5865200.660782]  0000000000000000 ffffffff8028956d ffff81100127e000 00000000ffffffff
11076 (E4)[   5865200.757206] Call Trace:
11077 (E4)[   5865200.798266]  <IRQ>
11078 (E4)[   5865200.831762]  [<ffffffff8028956d>] kmem_cache_alloc_node+0xfd/0x110
11079 (E4)[   5865200.915023]  [<ffffffff8044817c>] __alloc_skb+0x4c/0x160
11080 (E4)[   5865200.987911]  [<ffffffff804482bb>] __netdev_alloc_skb+0x2b/0x50
11081 (E4)[   5865201.067036]  [<ffffffff880f8746>] :e1000e:e1000_alloc_rx_buffers+0x1d6/0x250
11082 (E4)[   5865201.160729]  [<ffffffff880f904f>] :e1000e:e1000_clean_rx_irq+0x2ff/0x320
11083 (E4)[   5865201.250308]  [<ffffffff880fb089>] :e1000e:e1000_poll+0x79/0x2e0
11084 (E4)[   5865201.330490]  [<ffffffff8810c6b5>] :e1000e:__kc_adapter_clean+0x35/0x60
11085 (E4)[   5865201.417960]  [<ffffffff80450516>] net_rx_action+0x96/0x180
11086 (E4)[   5865201.492967]  [<ffffffff8023804b>] __do_softirq+0x8b/0x120
11087 (E4)[   5865201.566893]  [<ffffffff8020d0ac>] call_softirq+0x1c/0x30
11088 (E4)[   5865201.639780]  [<ffffffff8020e6ea>] do_softirq+0x4a/0xb0
11089 (E4)[   5865201.710594]  [<ffffffff8020e5c2>] do_IRQ+0xc2/0x1a0
11090 (E4)[   5865201.778297]  [<ffffffff8020c3c1>] ret_from_intr+0x0/0xa

The stack trace is always like this.  

There is one more thread which is working on the same skbuff as the above task.

13369     [exception RIP: _spin_lock+7]
13370     RIP: ffffffff804de8c7  RSP: ffff8105cffa5af0  RFLAGS: 00000082
13371     RAX: ffff81101830ac00  RBX: ffff8101f82b86c0  RCX: 0000000000000000
13372     RDX: ffff81101830ac00  RSI: 0000000000000020  RDI: ffff8110182abd80
13373     RBP: 0000000000000000   R8: 0000000000000080   R9: 0000000000000080
13374     R10: 0000000000000000  R11: 0000000000000000  R12: 000000000000003c
13375     R13: ffff81101830ac00  R14: ffff8110182abd40  R15: ffff8110182afcc0
13376     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
13377 --- <NMI exception stack> ---
13378   [ffffffff80725000] _spin_lock at ffffffff804de8c7
13379   [ffff8105cffa5af0] cache_alloc_refill at ffffffff80289839
13380   [ffff8105cffa5b40] kmem_cache_alloc_node at ffffffff8028956d
13381   [ffff8105cffa5b70] __alloc_skb at ffffffff8044817c
13382   [ffff8105cffa5bb0] tcp_send_ack at ffffffff804977d6
13383   [ffff8105cffa5bc0] tcp_rcv_established at ffffffff80492d2b
13384   [ffff8105cffa5c00] tcp_v4_do_rcv at ffffffff8049ab05
13385   [ffff8105cffa5c08] autoremove_wake_function at ffffffff80248110
13386   [ffff8105cffa5c60] tcp_prequeue_process at ffffffff80489e5b
13387   [ffff8105cffa5c80] tcp_recvmsg at ffffffff8048a7f6
13388   [ffff8105cffa5d10] sock_common_recvmsg at ffffffff804475a0
13389   [ffff8105cffa5d30] sock_aio_read at ffffffff80442f3f
13390   [ffff8105cffa5dd0] do_sync_read at ffffffff8029076b
13391   [ffff8105cffa5e40] autoremove_wake_function at ffffffff80248110

Tried to debug the core file as below:

ffff8110182afb40 skbuff_fclone_cache      448          5      1072    134     4k
kmem: skbuff_head_cache: partial list: slab: ffff8103368d3000  bad inuse counter: 15
ffff8110182afcc0 skbuff_head_cache        256         14        15      1     4k

crash> slab ffff8103368d3000
struct slab {
  list = {
    next = 0xffff810322925040,
    prev = 0xffff8110182abd40                 This is actually kmem_list3 - l3 cache object of skbuff_head_cache.
                                                                   Need to check why this is assigned as previous slab pointer.
  },
  colouroff = 128,
  s_mem = 0xffff8103368d3080,
  inuse = 15, ------------> This should be 14
  free = 4294967295, ----> This should be 1. As its unsigned value, it should never go below 0. Looks like double free has happened.
  nodeid = 0 
}

crash> kmem_list3 ffff8110182abd40
struct kmem_list3 {
  slabs_partial = {
    next = 0xffff8103368d3000,---> referred as the slab with bad inuse
    prev = 0xffff810291f4f000
  },
  slabs_full = {
    next = 0xffff81020b75e000,
    prev = 0xffff810342085000
  },
  slabs_free = {
    next = 0xffff8110182abd60,
    prev = 0xffff8110182abd60
  },
  free_objects = 3165,
  free_limit = 1035,
  colour_next = 1,
  list_lock = {

The
slab->inuse counter did not get decreased when a slab
object is freed, thus corrupting the slab list.

All other nodes in prev and next are fine and only this insertion is wrong.
So definitely the prev pointer assignment for ffff8103368d3000 node has gone wrong.
Tried to search the references read from the corrupt address.

crash> rd -s ffff810000725dc0 20
ffff810000725dc0:  ffff8110182abd40 ffff81101830ac00
ffff810000725dd0:  000000000000003c 0000000000000000

crash> rd -s ffff810000725f60 50
ffff810000725f60:  ffff8110182abd40 ffff81101830ac00
ffff810000725f70:  000000000000003c 0000000000000000

Found that ffff8110182abd40 is the kmem_list3 L3 cache object of node 0 present in skbuff_head_cache

crash> kmem_cache 0xffff8110182afcc0
struct kmem_cache {
  array = {0xffff81101830ac00, 0xffff81101830a800, 0xffff81101830a400, 0xffff81101830a000, 0xffff81101830bc00, 0xffff81101830b800, 0xffff81101830b400, 0xffff81101830b000, 0xffff
81101830cc00, 0xffff81101830c800, 0xffff81101830c400, 0xffff81101830c000, 0xffff81101830dc00, 0xffff81101830d800, 0xffff81101830d400, 0xffff81101830d000, 0x0, 0x0, 0x0, 0x0, 0x0
, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
…
..

  nodelists = {0xffff8110182abd40, 0x0, 0x0, 0xffff8110182b0c00, 0xffff8110182b0800,

ffff81101830ac00 ---> is present in following cpu register map

  13362               START: crash_nmi_callback at ffffffff8021c236
  13363   [ffffffff80725eb0] crash_nmi_callback at ffffffff8021c236
  13364   [ffffffff80725ec0] notifier_call_chain at ffffffff804e195f
  13365   [ffffffff80725ef0] notify_die at ffffffff8024babd
  13366   [ffffffff80725f20] default_do_nmi at ffffffff804dfd61
  13367   [ffffffff80725f40] do_nmi at ffffffff804e0357
  13368   [ffffffff80725f50] nmi at ffffffff804df5ef
  13369     [exception RIP: _spin_lock+7]
  13370     RIP: ffffffff804de8c7  RSP: ffff8105cffa5af0  RFLAGS: 00000082
  13371     RAX: ffff81101830ac00  RBX: ffff8101f82b86c0  RCX: 0000000000000000
  13372     RDX: ffff81101830ac00  RSI: 0000000000000020  RDI: ffff8110182abd80
  13373     RBP: 0000000000000000   R8: 0000000000000080   R9: 0000000000000080
  13374     R10: 0000000000000000  R11: 0000000000000000  R12: 000000000000003c
  13375     R13: ffff81101830ac00  R14: ffff8110182abd40  R15: ffff8110182afcc0
  13376     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  13377 --- <NMI exception stack> ---
  13378   [ffffffff80725000] _spin_lock at ffffffff804de8c7
  13379   [ffff8105cffa5af0] cache_alloc_refill at ffffffff80289839
  13380   [ffff8105cffa5b40] kmem_cache_alloc_node at ffffffff8028956d
  13381   [ffff8105cffa5b70] __alloc_skb at ffffffff8044817c
  13382   [ffff8105cffa5bb0] tcp_send_ack at ffffffff804977d6
  13383   [ffff8105cffa5bc0] tcp_rcv_established at ffffffff80492d2b
  13384   [ffff8105cffa5c00] tcp_v4_do_rcv at ffffffff8049ab05  13385   [ffff8105cffa5c08] autoremove_wake_function at ffffffff80248110
  13386   [ffff8105cffa5c60] tcp_prequeue_process at ffffffff80489e5b
  13387   [ffff8105cffa5c80] tcp_recvmsg at ffffffff8048a7f6

ffff8110182afcc0 --> skbuff_head_cache
whereas ffff81101830ac00 is the array_cache for node 0. 

crash> array_cache 0xffff81101830ac00----------> 
struct array_cache {
  avail = 0, ----> array cache is empty
  limit = 120,
  batchcount = 60,
  touched = 1,
  lock = {
    raw_lock = {
      slock = 1
    },
    tracker = {
      caller = 0,
      owner = 0
    }
  },
  entry = 0xffff81101830ac20
}

crash> kmem -S skbuff_head_cache
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
kmem: skbuff_head_cache: partial list: slab: ffff8103368d3000  bad inuse counter: 15
ffff8110182afcc0 skbuff_head_cache        256         14        15      1     4k
kmem: skbuff_head_cache: partial list: slab: ffff8103368d3000  bad inuse counter: 15
SLAB              MEMORY            TOTAL  ALLOCATED  FREE
ffff81020b75e000  ffff81020b75e080     15         14     1
FREE / [ALLOCATED]
  [ffff81020b75e080]
  [ffff81020b75e180]
  [ffff81020b75e280]
  [ffff81020b75e380]
  [ffff81020b75e480]
  [ffff81020b75e580]
   ffff81020b75e680  (cpu 15 cache) ----> this is freed cache 
  [ffff81020b75e780]
  [ffff81020b75e880]
  [ffff81020b75e980]
  [ffff81020b75ea80]
  [ffff81020b75eb80]
  [ffff81020b75ec80]
  [ffff81020b75ed80]
  [ffff81020b75ee80]

I see this pattern in all the references of the corrupt value. Not sure what is the link of both.

This looks like even when the slabs_partial was full, it was not moved to slabs_full which caused the other thread
assume that there is a slab available in this cache and it panics at

/*
 * The slab was either on partial or free list so
 * there must be at least one object available for
 * allocation.
 */
BUG_ON(slabp->inuse >= cachep->num);

There are lots of activity happening on e1000e driver.
1263: 3567059092  PCI-MSI-edge     eth5-----------> e1000e interrupt count
This is on CPU 15 where the slab corruption happened. On other CPUs as well there is more or less equal load. Irqbalance
is distributing the load.

We have enabled slab_debug and running it for few days but as I mentioned its hard to repro the issue.

Is there any way to figure out when and how the insertion of slab would have corrupted? Can you throw some light on this issue?

Thanks

��.n������g����a����&ޖ)���)��h���&������梷�����Ǟ�m������)�����b�n���y��{^�w�r���&�i��('����춊m�鞵��â����چ�����i�������$����