Rainer Fügenstein <rfu@xxxxxxxxxx> writes: > Hi, > > my NAS-like server with 5*3TB SATA drives in RAID5 configuration was > running without problems for what seems an eternity; since about 3 > weeks it keeps freezing every other day with the following error: > > # grep soft /var/log/messages > Oct 15 11:26:49 alfred kernel: BUG: soft lockup - CPU#1 stuck for 60s! [md0_raid5:1614] > Oct 15 11:26:49 alfred kernel: [<ffffffff8005e298>] call_softirq+0x1c/0x28 > Oct 15 11:26:49 alfred kernel: [<ffffffff80012583>] __do_softirq+0x51/0x133 > Oct 15 11:26:49 alfred kernel: [<ffffffff8005e298>] call_softirq+0x1c/0x28 > Oct 15 11:26:49 alfred kernel: [<ffffffff8006d63a>] do_softirq+0x2c/0x7d > Oct 15 11:27:49 alfred kernel: BUG: soft lockup - CPU#1 stuck for 60s! [md0_raid5:1614] > Oct 15 11:27:49 alfred kernel: [<ffffffff8005e298>] call_softirq+0x1c/0x28 > Oct 15 11:27:49 alfred kernel: [<ffffffff80012583>] __do_softirq+0x51/0x133 > Oct 15 11:27:49 alfred kernel: [<ffffffff8005e298>] call_softirq+0x1c/0x28 > Oct 15 11:27:49 alfred kernel: [<ffffffff8006d63a>] do_softirq+0x2c/0x7d > Oct 15 11:28:49 alfred kernel: BUG: soft lockup - CPU#1 stuck for 60s! [md0_raid5:1614] > Oct 15 11:28:49 alfred kernel: [<ffffffff8005e298>] call_softirq+0x1c/0x28 > Oct 15 11:28:49 alfred kernel: [<ffffffff80012583>] __do_softirq+0x51/0x133 > Oct 15 11:28:49 alfred kernel: [<ffffffff8005e298>] call_softirq+0x1c/0x28 > Oct 15 11:28:49 alfred kernel: [<ffffffff8006d63a>] do_softirq+0x2c/0x7d > [...] > this is only part of the story, check the end of this message for > a detailed log. > > sometimes the server recovers after 60+ seconds, sometimes it requires > a hard reset (causing mdraid to re-sync the whole array). I strongly recommend adding a write-intend bitmap mdadm --grow /dev/md0 --bitmap=internal that will speed up the resync enormously. > > IIRC, it started when a drive in the array failed with "SATA > connection timeouts" (kind of). this drive has been replaced by a new > one, but yet the CPU lockups keep coming. > > I suspect that aging hardware slowly starts to fail, but not sure > which part (drives? SATA controller? cables? NIC? CPU? ...) > > here's some info that might be useful: > # uname -a > Linux alfred 2.6.18-406.el5 #1 SMP Tue Jun 2 17:25:57 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux This is a rather ancient kernel. The "el" suffix probably suggests Redhat? If you have a Redhat support contract you should ask them. If you don't, you should probably try a newer kernel (or buy a support contract). NeilBrown
Attachment:
signature.asc
Description: PGP signature