Re: RAID 6 freezing system when stripe_cache_size is increased from default

Enigma <enigma@xxxxxxxxxxxxxxxxxx> · Tue, 1 Dec 2009 10:39:45 -0700

Is there nobody who can give me any additional information on this?
Executive Summary:  Machine freezes with the kernel dump below when
stripe_cache_size > 256

Please help if you can, running at 256 is killing performance.

On Thu, Nov 19, 2009 at 7:53 PM, Enigma <enigma@xxxxxxxxxxxxxxxxxx> wrote:
> I am in the process of migrating a 8x200 GB disk RAID 6 array to a
> 8x500 disk array.  I created the array with 2 missing disks and I
> added them after the array is started.  The array synced fine at the
> default of 256 for /sys/block/md0/md/stripe_cache_size, but if I
> changed it to a higher value, for example  "echo 4096 >
> /sys/block/md0/md/stripe_cache_size" the system freezes up.  The
> previous array was running fine with a cache size of 8192.  The only
> difference between my old array and this array is I increased the
> chunk size to 512 from 256.  The machine is a dual Xeon w/
> hyperthreading, 3 GB of main memory, kernel 2.6.29.1, mdadm v2.6.7.2.
> I let the array sync at the default cache size (with fairly poor
> performance) and tested the synced array and get the same behavior
> under load.  Whenever the cache size > 256 I get the following hang:
>
> [ 1453.847111] BUG: soft lockup - CPU#3 stuck for 61s! [md0_raid5:571]
> [ 1453.863456] Modules linked in: ipv6 dm_mod iTCO_wdt intel_rng
> rng_core pcspkr evdev i2c_i801 i2c_core e7xxx_edac edac_core
> parport_pc parport containern
> [ 1453.919458]
> [ 1453.923455] Pid: 571, comm: md0_raid5 Not tainted (2.6.29.1-JJ #7) SE7501CW2
> [ 1453.943454] EIP: 0060:[<c033ec4e>] EFLAGS: 00000286 CPU: 3
> [ 1453.959453] EIP is at raid6_sse22_gen_syndrome+0x132/0x16c
> [ 1453.979454] EAX: dcca66c0 EBX: ffffffff ECX: 000006c0 EDX: dd1be000
> [ 1453.995452] ESI: f6005e60 EDI: f6005e5c EBP: 00000014 ESP: f6005e30
> [ 1454.015452]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> [ 1454.031451] CR0: 80050033 CR2: b7ede195 CR3: 066e8000 CR4: 000006d0
> [ 1454.051451] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [ 1454.071450] DR6: ffff0ff0 DR7: 00000400
> [ 1454.083450] Call Trace:
> [ 1454.087450]  [<c033adc1>] ? compute_parity6+0x201/0x26c
> [ 1454.103449]  [<c033b7b2>] ? handle_stripe+0x6bc/0xad0
> [ 1454.119449]  [<c015537c>] ? rcu_process_callbacks+0x33/0x39
> [ 1454.139449]  [<c012a24e>] ? __do_softirq+0x7f/0x125
> [ 1454.151448]  [<c033bf6f>] ? raid5d+0x3a9/0x3b7
> [ 1454.167448]  [<c03d1b87>] ? schedule_timeout+0x13/0x86
> [ 1454.179447]  [<c01176f5>] ? default_spin_lock_flags+0x5/0x8
> [ 1454.199447]  [<c0347c76>] ? md_thread+0xb6/0xcc
> [ 1454.211446]  [<c0135a11>] ? autoremove_wake_function+0x0/0x2d
> [ 1454.231446]  [<c0347bc0>] ? md_thread+0x0/0xcc
> [ 1454.243446]  [<c0135952>] ? kthread+0x38/0x5e
> [ 1454.255445]  [<c013591a>] ? kthread+0x0/0x5e
> [ 1454.267445]  [<c0103b93>] ? kernel_thread_helper+0x7/0x10
>
>
> In searching for a cause to the problem I have found a few other
> people who had issues like this, but they all seemed to be on a older
> kernel and the cause was a deadlock that should be resolved by my
> version (ex. http://marc.info/?l=linux-raid&m=116946415327616&w=2).
> Are there any known bugs that are present in my kernel that would
> cause behavior like this?  Here is some info about the array:
>
> #mdadm --examine /dev/sda2
> /dev/sda2:
>          Magic : a92b4efc
>        Version : 00.90.00
>           UUID : 65f266b7:852d5253:a847f9a3:2c253025
>  Creation Time : Thu Nov 19 01:57:33 2009
>     Raid Level : raid6
>  Used Dev Size : 401118720 (382.54 GiB 410.75 GB)
>     Array Size : 2406712320 (2295.22 GiB 2464.47 GB)
>   Raid Devices : 8
>  Total Devices : 8
> Preferred Minor : 0
>
>    Update Time : Thu Nov 19 19:40:26 2009
>          State : clean
>  Active Devices : 8
> Working Devices : 8
>  Failed Devices : 0
>  Spare Devices : 0
>       Checksum : 16b3ddef - correct
>         Events : 1150
>
>     Chunk Size : 512K
>
>      Number   Major   Minor   RaidDevice State
> this     0       8        2        0      active sync   /dev/sda2
>
>   0     0       8        2        0      active sync   /dev/sda2
>   1     1       8       18        1      active sync   /dev/sdb2
>   2     2       8       34        2      active sync   /dev/sdc2
>   3     3       8       50        3      active sync   /dev/sdd2
>   4     4       8       66        4      active sync   /dev/sde2
>   5     5       8       98        5      active sync   /dev/sdg2
>   6     6       8       82        6      active sync   /dev/sdf2
>   7     7       8      114        7      active sync   /dev/sdh2
>
>
>
> # cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
> [raid4] [multipath]
> md1 : active raid1 hdc1[1] hda1[0]
>      4200896 blocks [2/2] [UU]
>
> md0 : active raid6 sdh2[7] sdg2[5] sdf2[6] sde2[4] sdd2[3] sdc2[2]
> sdb2[1] sda2[0]
>      2406712320 blocks level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU]
>
> unused devices: <none>
>
>
>
> Can anyone point me at some information to debug this problem?
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html