Re: Machine hanging on synchronize cache on shutdown 2.6.22-rc4-git[45678]

Mikael Pettersson <mikpe@xxxxxxxx> · Mon, 18 Jun 2007 13:29:00 +0200 (MEST)

On Mon, 18 Jun 2007 16:09:49 +0900, Tejun Heo wrote:
> Mikael Pettersson wrote:
> > On Sat, 16 Jun 2007 15:52:33 +0400, Brad Campbell wrote:
> >> I've got a box here based on current Debian Stable.
> >> It's got 15 Maxtor SATA drives in it on 4 Promise TX4 controllers.
> >>
> >> Using kernel 2.6.21.x it shuts down, but of course with a huge "clack" as 15 drives all do emergency 
> >> head parks simultaneously. I thought I'd upgrade to 2.6.22-rc to get around this but the machine 
> >> just hangs up hard apparently trying to sync cache on a drive.
> >>
> >> I've run this process manually, so I know it is being performed properly.
> >>
> >> Prior to shutdown, all nfsd processes are stopped, filesystems unmounted and md arrays stopped.
> >> /proc/mdstat shows
> >> root@storage1:~# cat /proc/mdstat
> >> Personalities : [raid6] [raid5] [raid4]
> >> unused devices: <none>
> >> root@storage1:~#
> >>
> >> Here is the final hangup.
> >>
> >> http://www.fnarfbargle.com/CIMG1029.JPG
> > 
> > Something sent a command to the disk on ata15 after the PHY had been
> > offlined and the interface had been put in SLUMBER state (SStatus 614).
> > Consequently the command timed out. Libata tried a soft reset, and then
> > a hard reset, after which the machine hung.
> 
> Hmm... weird.  Maybe device initiated power saving (DIPS) is active?
> 
> > I don't think sata_promise is the guilty party here. Looks like some
> > layer above sata_promise got confused about the state of the interface.
> 
> But locking up hard after hardreset is a problem of sata_promise, no?

Maybe, maybe not. The original report doesn't specify where/how
the machine hung.

Brad: can you enable sysrq and check if the kernel responds to
sysrq when it appears to hang, and if so, where it's executing?

sata_promise just passes sata_std_hardreset to ata_do_eh.
I've certainly seen EH hardresets work before, so I'm assuming
that something in this particular situation (PHY offlined,
kernel close to shutting down) breaks things.

FWIW, I'm seeing scsi layer accesses (cache flushes) after things
like rmmod sata_promise. They error out and don't seem to cause
any harm, but the fact that they occur at all makes me nervous.

/Mikael
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html