Hello. This is a follow up of a discussion on linux-scsi list.
quick resume:
I have a san with 2 raids (IFT 7250F), and a farm of servers attached to
it. All are the same hardware (Bi Xeon 2.8 Ghz, Qla 2310F 2 GB Ram ) &
running the very same kernels (2.4.26 + dm 0.17 + qla 6.06.64)
The data are on lvm volumes managed with evms. All was running fine
since months, whit giga & giga of data moved every day.
But at the beginning of the week I began to have errors like thoses :
SCSI disk error : host 3 channel 0 id 2
lun 1 return code = 20000
to the extent were the partition was totally unavailable.
After search & try -all elements were suspected- (the raid itself, his
firmware, the qlogic driver), I finally managed to have a system working
again ;
2.4.26 + dm 0.17 + qla 6.06.64 = scsi error
2.4.26 + dm 0.17 + qla 7.00.03 = hang (no errors, but all scsi/qla
operations just hang) then (after long time) scsi errors
-> upgrade the san RAID to the very last firmware,
same kernels as above : same errors
2.6.7 (with embedded dm & qla 8.xx.x) = scsi error too (tried that
because I know lots of work has been done on the scsi, qla & dm layer)
2.4.27-rc3 + dm 0.19 + qla 7.00.03 = No errors.
2.4.27-rc3 + dm 0.19 + qla 6.06.64 = No errors.
As my problem has only arised on lvm volumes that has been resized by
evms, and the only difference between
failing & non failing operation is the device mapper version, I begins
to wonder if the culprit is not here.
Is it possible that something changed beetween dm 1.0.17 and 1.0.19 that can expain this behaviour ?
Or beetween 2.4.26 & 2.4.27-rc3 (in that case I have to change of mailing list again ;-)
Anyway i'm very pleased my problem is solved but i'd like to find a final explanation ...
--
Yann Dupont, Cri de l'université de Nantes
Tel: 02.51.12.53.91 - Fax: 02.51.12.58.60 - Yann.Dupont@xxxxxxxxxxxxxx