hi,
plz check the following
1) e2fsck with -c /dev/sda1
2) hard disk with vendor supply hardware/SAN health monitoring utility
3) Is it any network flood like DOS attack in your network.
regards
karthikeyan.N
Neil Watson wrote:
I'm build a cluster that runs a DB2 service. The cluster has 2 nodes
in an active standby configuration. I am now performing fail over
tests.
Shared resources:
DB2 controlled by /etc/init.d/db2 start stop script.
Floating IP address.
/db2 ext3 file system located on a SAN and connected via HBA.
Nodes are fenced with ILO cards.
Nodes are running AS4 x86_64 with the Redhat Cluster Suite. RPMs are up
to date.
Procedure:
1. Connect to DB2 remotely and begin a long SQL insert program.
2. While the inserts a being performed, disconnected the fibre cable
from the HBA, on the active node.
3. Examine the system logs an observe for fail over.
Observations:
1. Cluster does not fail over to standby node. Service becomes
unavailable.
2. The log files of the active node report a 'generic error' about the
status
of the shared file system.
Aug 16 15:32:37 caesar kernel: qla2300 0000:06:01.0: LOOP DOWN detected
(2).
Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector
15839
Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector
15847
Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector
15855
Aug 16 15:32:45 caesar kernel: Buffer I/O error on device sda1, logical
block 1974
Aug 16 15:32:45 caesar kernel: lost page write due to I/O error on sda1
Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector
103813199
Aug 16 15:32:45 caesar kernel: Buffer I/O error on device sda1, logical
block 12976642
Aug 16 15:32:45 caesar kernel: lost page write due to I/O error on sda1
Aug 16 15:32:45 caesar kernel: Aborting journal on device sda1.
Aug 16 15:32:45 caesar kernel: ext3_abort called.
Aug 16 15:32:45 caesar kernel: EXT3-fs error (device sda1):
ext3_journal_start_sb: Detected aborted journal
Aug 16 15:32:45 caesar kernel: Remounting filesystem read-only
Aug 16 15:32:47 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
Aug 16 15:32:47 caesar kernel: end_request: I/O error, dev sda, sector 8279
Aug 16 15:32:47 caesar kernel: Buffer I/O error on device sda1, logical
block 1027
Aug 16 15:32:47 caesar kernel: lost page write due to I/O error on sda1
Aug 16 15:32:47 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
Aug 16 15:32:47 caesar kernel: end_request: I/O error, dev sda, sector
103546959
Aug 16 15:32:47 caesar kernel: Buffer I/O error on device sda1, logical
block 12943362
Aug 16 15:32:47 caesar kernel: lost page write due to I/O error on sda1
Aug 16 15:32:48 caesar clurgmgrd[5159]: <notice> status on fs "db2"
returned 1 (generic error)
Aug 16 15:32:48 caesar clurgmgrd[5159]: <notice> Stopping service db2
Aug 16 15:32:48 caesar clurgmgrd: [5159]: <info> Executing
/etc/rc.d/init.d/db2 stop
Aug 16 15:32:48 caesar su(pam_unix)[1227]: session opened for user
dwapinst by (uid=0)
Aug 16 15:32:49 caesar su:
Aug 16 15:32:49 caesar su: Instance : dwapinst
Aug 16 15:32:49 caesar su: DB2 State : Available
Aug 16 15:32:49 caesar su(pam_unix)[1227]: session closed for user dwapinst
Aug 16 15:32:49 caesar db2: succeeded
Aug 16 15:32:49 caesar su(pam_unix)[1322]: session opened for user
dwapinst by (uid=0)
Aug 16 15:36:51 caesar su(pam_unix)[5473]: session opened for user root
by nhwatson(uid=0)
Aug 16 15:36:55 caesar su(pam_unix)[6000]: session opened for user
dwapinst by nhwatson(uid=0)
Aug 16 15:36:55 caesar su:
Aug 16 15:36:55 caesar su: Instance : dwapinst
Aug 16 15:36:55 caesar su: DB2 State : Operable
Aug 16 15:36:55 caesar su(pam_unix)[6000]: session closed for user dwapinst
Aug 16 15:36:55 caesar db2: failed
3. The are no log entries for this event on the standby node.
Why does the cluster fail during this test? What does the 'generic error'
mean?
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster