Tejun Heo wrote:
Alan Cox wrote:
13 10:12:20 kern: res ff/ff:ff:ff:ff:ff/ff:ff:ff:ff:ff/ff Emask 0x12
(ATA bus error)
rn: ata4: SError: { RecovComm PHYRdyChg 10B8B Dispar DevExch }
13 10:14:37 kern: ata4: port is slow to respond, please be patient
(Status 0xff)
First guess would be a dud drive but it could be power or cabling or
firmware or ...
Hmm... this could be either the drive or the controller.
----
Just to confirm -- this particular problem was due to a faulty
brand-new SATA Western_Digital drive that died. It hung the system
several times under load, but shortly after the above errors,
the system would not boot with that drive attached.
Secondary error: My ACPI impementation is, /apparently/, flakey.
I used to not be able to use acpi back in the 2.2 timeframe. But
sometime in the 2.4 timeframe, ACPI started working with this system
(a 440BX based motherboard). I thought ACPI support had improved.
Symptom of ACPI based boot vs. non: random hang (a few hours up to maybe
48 hours max). But after I thought ACPI was 'fixed', booting with ACPI
(or not) resulted in stable system.
But -- two different error types. Starting with the 2.6.25 series,
I started observing hangs again (same in the 2.6.26 series). My last
stable was 2.6.24.1. BUT -- I also occasionally noticed some rare
sporadic disk error messages (while looking for the cause of the hang) --
they weren't there in the "pre-hang" 2.6.24.1 kernel...(I couldn't
even get a 2.6.24.7 kernel to stay up for more than 2 days).
My upgrade strategy for disks has been to move to SATA disks as
I needed to replace older PATA's. Had alot of problems last Feb when
I tried to use SATA; after a few weeks of making no progress discovering
the source of he hangs, I went back to a PATA drive and took out the SATA
controller -- and system went back to stable. Ok...I'm tired of
debugging this...lets stay with PATA for now.
Six months later...need another disk. Back to trying SATA...
more hangs (and a bad disk drive). It seems that in addition to
ACPI no longer working above my 2.6.24.1 kernel, adding in the SATA
board also would cause an ACPI based boot to eventually hang (max
runtime ~30 hours). Using the kernel load option "acpi=noirq", seems to be
the key to stability now.
So I don't know exactly what changed -- but ACPI, which was working
(pre-SATA) seemed to stop being reliable after 2.6.24.1.
Anyway I cut it, acpi=noirq now seems to be a requirement for
system stability. My ACPI version string shows it as "1.0"...so I'm
guessing there might have been some kinks in the implementation.
So had 4 different problems all converge at roughly the same time:
1) new SATA Western_Digital-1TB disk failure,
2) ACPI-induced instability in 2.6.25 and above
3) ACPI induced instability with addition of new SATA controller
(including a rebuilt-for-sata-support 2.6.24.1).
4) Auxiliary cooling fan failed and system would get 'warm' (don't know
exact temps, but some disks were nearing 50C (normal is mid 30's,
except for the 15K system SCSI. It has its own attached fan, so
it's usually a few degrees cooler when the case-fans are operating
correctly.
However, the disk temps are not indicative of the CPU temps -- they
are only an indirect sign that case-airflow is sub-optimal. The
CPU's (2 1GHz P-III's) in this baby don't give reliable thermal
warnings (have only ever seen 1). Usually the system will
just 'hang' (not the most helpful indicator in any event).
Thanks much for feedback that led me to figuring out (*crossing
fingers*) the problems and fixes...
Linda Walsh
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html