Re: Raid5 device hangs in active state

Larkin Lowrey <llowrey@xxxxxxxxxxxxxxxxx> · Tue, 28 Feb 2012 15:33:56 -0600

Thank you for taking a look.

I should be able to move the drives to a completely different controller
(and driver) so that will be a good test.

Could NCQ be an issue? IOW, do you think it would be worth disabling NCQ
and re-running this scenario?

--Larkin

On 2/28/2012 1:52 PM, NeilBrown wrote:
> On Tue, 28 Feb 2012 12:21:39 -0600 Larkin Lowrey
<llowrey@xxxxxxxxxxxxxxxxx>
> wrote:
>
>> I did another sysrq dump and have attached the output.
>
> Thanks. Unfortunately it contains nothing of value - too much has been
> lost. It seems that 'Show State' contains a lot more noise than it used to.
>
> You will need to boot with
> log_buf_len=4M
>
> or something like that.
>
>>
>> Again, 'iostat -dx 1' showed 100% utilization on the LVM which uses
>> /dev/md0 as a pv and /sys/block/md0/md/stripe_cache_active was 29 and
>> that value did not change. There were no error messages in
>> /var/log/messages or 'dmesg'.
>
> The '29' could simply mean that md/raid5 has sent 29 requests down to lower
> levels which have not yet completed.
>>
>> My suspicions lie with md0 since the stripe_cache_active value remains
>> at a fixed non-zero value even though all disks are (or appear to be)
>> idle. Should I be looking elsewhere? This hardware did not exhibit this
>> problem before "upgrading" from Fedora 15 to Fedora 16.
>
> My guess is a problem with one of the drive controllers. Your monthly
'sync'
> puts a much heavier load on them than normal IO does. It is consistently
> sending a bunch of requests to all devices at exactly the same time. This
> could trigger race conditions that normal IO does not.
>
> But that is just a guess. Unfortunately it is very hard to track exactly
> what is going wrong in this sort of case.
>
> I'd suggest shuffling devices so they are on different controllers, or
maybe
> replace a controller. See if you can get the problem to move, and then see
> which controller it stayed with.
>
> NeilBrown
>
>
>>
>> Thank you,
>>
>> --Larkin
>>
>> On 1/8/2012 6:26 PM, NeilBrown wrote:
>>> On Sun, 08 Jan 2012 16:03:10 -0600 Larkin Lowrey
>> <llowrey@xxxxxxxxxxxxxxxxx>
>>> wrote:
>>>
>>>> Suggestions?
>>>
>>> # echo t > /proc/sysrq-trigger
>>>
>>> and capture that messages that go to 'dmesg'. Post them.
>>>
>>> Hopefully your message ring buffer is big enough to collect the entire
>>> output. If it isn't you might need to boot with
>>> log_buf_len=1M
>>> or similar.
>>>
>>> That should show what process is blocking on what.
>>>
>>> NeilBrown
>>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html