Re: [PATCH] md: raid10: wake up frozen array

Bill Davidsen <davidsen@xxxxxxx> · Fri, 05 Sep 2008 12:58:20 -0400

Arthur Jones wrote:
Hi Clive, ...

On Sat, Aug 30, 2008 at 02:30:52PM -0700, Clive Messer wrote:

On Fri, 2008-07-25 at 12:03 -0700, Arthur Jones wrote:

When rescheduling a bio in raid10, we wake up
the md thread, but if the array is frozen, this
will have no effect.  This causes the array to
remain frozen for eternity.  We add a wake_up
to allow the array to de-freeze.  This code is
nearly identical to the raid1 code, which has
this fix already.

Can someone explain this to me in simple terms?

The RAID sub-system needs to be able to synchronize
certain operations, to do this, it "freezes" the
array, i.e. no I/O will complete until it is un-frozen.
This bug hit when we failed an I/O while the array
was frozen.  In this case, we would never tell the
frozen array that it was time wake up and get back
to work and the retry would not make progress.

What will cause a rescheduling of bio?

If the first bio read attempt failed (e.g. broken
disk -- or, in my case, using fault injection),
then raid10 will retry the block I/O.

Frozen for eternity - what will be the effect assuming my root file
system is on raid10?

The failed I/O will not complete, the process which
started the I/O will be stuck in an unkillable state
forever.   Future I/O to the device would be put on
hold (I guess, I never looked at this directly).

I have a Fedora Core 9 box using a 4 disk f2 raid10 array. This is the
main partition and root file system. Every couple of days the machine
would hard lock. Sometimes I could ssh in. Most of the time not. I never
managed to catch anything to the logs with SysRq. With the benefit of
hindsight - if the kernel was 'jammed' writing to logfiles on a frozen
raid10 array that could explain it. I assumed faulty hardware. I have
actually replaced one at a time, (and at considerable expense), the
power supply, motherboard, processor, all 4 disks in the array. Still
the machine would lock-up. What is interesting is that I have managed 5
days uptime since I added this one line patch to
2.6.25.14-108.fc9.x86_64. Could someone confirm for me that it is more
than likely that the hard locks I experienced on this machine could be
resolved by this one line patch? Has this patch now made it into an
official kernel release?

It could be, but since you changed the drives
and controller, it doesn't seem too likely.  You
need some sort of failure to trigger this bug.
Also, Sys-rq still worked fine for me when I
triggered this bug...

This patch is now in linus' git tree, but it
looks like it missed 2.6.26, so it won't be in
an "official" release until 2.6.27...

I would hope that you or Neil would get it into the -stable series ASAP. 
While rare, this bug is a killer when it strikes.

--
Bill Davidsen <davidsen@xxxxxxx>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html