Re: Linux software RAID becomes unresponsive after removing a disk from server

NeilBrown <neilb@xxxxxxxx> · Wed, 21 Dec 2016 16:12:40 +1100

On Sat, Dec 17 2016, PHP-Friends GmbH wrote:

> Hello everyone,
>
> first of all: This is in fact a crossposting from serverfault 
> (http://serverfault.com/questions/821195/linux-software-raid-becomes-unresponsive-after-removing-a-disk-from-server), 
> as the user shodanshok recommended contacting this mailing list because 
> to him this seems like a possible bug in the Linux RAID software. I want 
> to add that I can provide more logs and information if they are needed, 
> but as the text is already quite long I thought that would be enough for 
> the moment.
>
> I am running a CentOS 7 machine (standard kernel: 
> 3.10.0-327.36.3.el7.x86_64) with a software RAID-10 over 16x 1 TB SSDs 
> (to be more precise, there are two RAID arrays on the disks; one of the 
> arrays is providing the host's swap partition). Last week, a SSD failed:
>
...

> 11:48:00 kvm7 kernel: INFO: task md3_raid10:1293 blocked for more than 
> 120 seconds.
> 11:48:00 kvm7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> 11:48:00 kvm7 kernel: md3_raid10      D ffff883f26e55c00     0 1293      
> 2 0x00000000
> 11:48:00 kvm7 kernel: ffff887f24bd7c58 0000000000000046 ffff887f212eb980 
> ffff887f24bd7fd8
> 11:48:00 kvm7 kernel: ffff887f24bd7fd8 ffff887f24bd7fd8 ffff887f212eb980 
> ffff887f23514400
> 11:48:00 kvm7 kernel: ffff887f235144dc 0000000000000001 ffff887f23514500 
> ffff8807fa4c4300
> 11:48:00 kvm7 kernel: Call Trace:
> 11:48:00 kvm7 kernel: [<ffffffff8163bb39>] schedule+0x29/0x70
> 11:48:00 kvm7 kernel: [<ffffffffa0104ef7>] freeze_array+0xb7/0x180 [raid10]

Might be a known bug, maybe the one fixed by
 Commit: ccfc7bf1f09d ("raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang")

I have no idea what patches are included in your centos kernel.
In general, we only provide support for mainline kernels here.  Not
because we don't want to support others, but because digging around
inside a non-mainline kernel is much more work.

BTW, to remove a device from an md array after it has been physically
removed from the system, you can use
  mdadm /dev/mdXX --remove detached
That wouldn't have helped you here, but for future reference.

NeilBrown
Attachment:
signature.asc

Description: PGP signature