Re: BUG: soft lockup - CPU#0 stuck for 10s [md2_raid1]

Steven Haigh <netwiz@xxxxxxxxx> · Sat, 26 Dec 2009 19:28:01 +1100

On 26/12/2009, at 7:23 PM, Steven Haigh wrote:

> Hi again,
> 
> I have another system that is eventually hanging when doing a resync on a software RAID1.
> 
> The system is another CentOS 5.4 install with a fairly vanilla config... The message is:
> 
> BUG: soft lockup - CPU#0 stuck for 10s! [md2_raid1:358]
> 
> Pid: 358, comm:            md2_raid1
> EIP: 0060:[<c04ec5dd>] CPU: 0
> EIP is at memcmp+0x12/0x22
> EFLAGS: 00000246    Not tainted  (2.6.18-164.6.1.el5 #1)
> EAX: 00000000 EBX: e4fc7606 ECX: e4caf606 EDX: 00000000
> ESI: 000009fa EDI: 00000054 EBP: e578b740 DS: 007b ES: 007b
> CR0: 8005003b CR2: 0806af70 CR3: 30d7c000 CR4: 000006d0
> [<f8843c64>] raid1d+0x270/0xbea [raid1]
> [<c0616db8>] schedule+0x9cc/0xa55
> [<c061747b>] schedule_timeout+0x13/0x8c
> [<c05a7029>] md_thread+0xdf/0xf5
> [<c0434c17>] autoremove_wake_function+0x0/0x2d
> [<c05a6f4a>] md_thread+0x0/0xf5
> [<c0434b55>] kthread+0xc0/0xeb
> [<c0434a95>] kthread+0x0/0xeb
> [<c0405c53>] kernel_thread_helper+0x7/0x10
> =======================
> 
> I have tried this with kernel 2.6.18-164.6.1.el5 and 2.6.18-164.9.1.el5 with the same results.
> 
> md0/1/3 all check without causing any CPU locks.
> 
> # cat /proc/mdstat 
> Personalities : [raid1] 
> md0 : active raid1 hdc1[1] hda1[0]
>      521984 blocks [2/2] [UU]
> 
> md1 : active raid1 hdc2[1] hda2[0]
>      10482304 blocks [2/2] [UU]
> 
> md3 : active raid1 hdc4[1] hda4[0]
>      1052160 blocks [2/2] [UU]
> 
> md2 : active raid1 hdc3[1] hda3[0]
>      300511808 blocks [2/2] [UU]
>      [>....................]  resync =  2.7% (8395136/300511808) finish=208.3min speed=23370K/sec
> 
> unused devices: <none>
> 
> # mdadm -Q --detail /dev/md2
> /dev/md2:
>        Version : 0.90
>  Creation Time : Mon Feb 23 17:15:41 2009
>     Raid Level : raid1
>     Array Size : 300511808 (286.59 GiB 307.72 GB)
>  Used Dev Size : 300511808 (286.59 GiB 307.72 GB)
>   Raid Devices : 2
>  Total Devices : 2
> Preferred Minor : 2
>    Persistence : Superblock is persistent
> 
>    Update Time : Sat Dec 26 19:21:36 2009
>          State : active, resyncing
> Active Devices : 2
> Working Devices : 2
> Failed Devices : 0
>  Spare Devices : 0
> 
> Rebuild Status : 3% complete
> 
>           UUID : fed99e3d:d08fdcc9:b9593a45:2cc09736
>         Events : 0.30587
> 
>    Number   Major   Minor   RaidDevice State
>       0       3        3        0      active sync   /dev/hda3
>       1      22        3        1      active sync   /dev/hdc3
> 
> Interestingly, this is the same box that randomly comes up with an ext3 bad block on /dev/md2 and remounts the filesystem readonly that I posted about a few hours ago.

Argh. Hit the send button too early and missed this bit...

A while ago there was conversation that this may have been fixed in 2.6.32 however there was no set patch that was discovered that fixed this. Has there been any progress on this? Is it possible for the RH guys to merge this patch (if/when discovered) into a RHEL kernel?

--
Steven Haigh

Email: netwiz@xxxxxxxxx
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html