Re: Storage/SCSI Error on our CentOS server

"Hairul Ikmal Mohamad Fuzi" <hairul.ikmal@xxxxxxxxx> · Fri, 2 Mar 2007 12:22:29 +0800

Just another follow up, after we swapped the SCSI host adapter, the
storage seems to be working fine (no more read/write,access/mount
error).

But after a while (e.g : once or twice a in 2 days), eventhough it is
working fine, we still got some error messages, which we guess is
somehow similar to messages in the previous posts.

We do not know how critical is this,but maybe you guys could give
valuable advise or inputs (either we should also change the cables as
well,etc).

TIA.
Cheers!

-Ikmal

Latest 'dmesg' excerpts, appeared once in last 2 days :

scsi5:0:0:0: Attempting to abort cmd 0000010002fa5380: 0x28 0x0 0x0
0x0 0x1 0x3f 0x0 0x0 0x8 0x0
scsi5: At time of recovery, card was not paused
Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi5: Dumping Card State at program address 0x5 Mode 0x33
Card was paused
HS_MAILBOX[0x0] INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11]
DFFSTAT[0x33] SCSISIGI[0x0] SCSIPHASE[0x0] SCSIBUS[0x0]
LASTPHASE[0x1] SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0]
SEQINTCTL[0x0] SEQ_FLAGS[0xc0] SEQ_FLAGS2[0x0] SSTAT0[0x0]
SSTAT1[0x8] SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0xc0]
SIMODE1[0xa4] LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0]
LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0x0]

SCB Count = 4 CMDS_PENDING = 1 LASTSCB 0xffff CURRSCB 0x2 NEXTSCB 0x0
qinstart = 59 qinfifonext = 59
QINFIFO:
WAITING_TID_QUEUES:
Pending list:
 2 FIFO_USE[0x0] SCB_CONTROL[0x64] SCB_SCSIID[0x7]
Total 1
Kernel Free SCB list: 3 1 0
Sequencer Complete DMA-inprog list:
Sequencer Complete list:
Sequencer DMA-Up and Complete list:

scsi5: FIFO0 Free, LONGJMP == 0x80ff, SCB 0x0
SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]
SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
scsi5: FIFO1 Free, LONGJMP == 0x81d8, SCB 0x3
SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]
SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
LQIN: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0 0x0 0x0 0x0
scsi5: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52
scsi5: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x0
SIMODE0[0xc]
CCSCBCTL[0x0]
scsi5: REG0 == 0xffff, SINDEX = 0x1e0, DINDEX = 0xe1
scsi5: SCBPTR == 0x3, SCB_NEXT == 0x2, SCB_NEXT2 == 0x2
CDB 28 0 0 80 19 7c
STACK: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
DevQ(0:0:0): 0 waiting
(scsi5:A:0:0): Device is disconnected, re-queuing SCB
Recovery code sleeping
Recovery SCB completes
Recovery code awake
scsi5: Transmission error detected
LQISTAT1[0x0] LASTPHASE[0x1] SCSISIGI[0x0] PERRDIAG[0x1]
Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi5: Dumping Card State at program address 0x26 Mode 0x11
Card was paused
HS_MAILBOX[0x0] INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11]
DFFSTAT[0x33] SCSISIGI[0x1a] SCSIPHASE[0x1] SCSIBUS[0xff]
LASTPHASE[0x1] SCSISEQ0[0x40] SCSISEQ1[0x12] SEQCTL0[0x0]
SEQINTCTL[0x0] SEQ_FLAGS[0xc0] SEQ_FLAGS2[0x0] SSTAT0[0x10]
SSTAT1[0x11] SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0]
SIMODE1[0xac] LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0]
LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0x0]

SCB Count = 4 CMDS_PENDING = 1 LASTSCB 0xffff CURRSCB 0x2 NEXTSCB 0x0
qinstart = 61 qinfifonext = 61
QINFIFO:
WAITING_TID_QUEUES:
      0 ( 0x2 )
Pending list:
 2 FIFO_USE[0x0] SCB_CONTROL[0x50] SCB_SCSIID[0x7]
Total 1
Kernel Free SCB list: 3 1 0
Sequencer Complete DMA-inprog list:
Sequencer Complete list:
Sequencer DMA-Up and Complete list:

scsi5: FIFO0 Free, LONGJMP == 0x80ff, SCB 0x0
SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]
SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
SOFFCNT[0x1] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
scsi5: FIFO1 Free, LONGJMP == 0x81d8, SCB 0x3
SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]
SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0]
SOFFCNT[0x1] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]
LQIN: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0 0x0 0x0 0x0
scsi5: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52
scsi5: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x0
SIMODE0[0xc]
CCSCBCTL[0x4]
scsi5: REG0 == 0x3, SINDEX = 0x11d, DINDEX = 0xe1
scsi5: SCBPTR == 0x3, SCB_NEXT == 0x2, SCB_NEXT2 == 0x2
CDB 28 0 0 80 19 7c
STACK: 0x13 0x0 0x0 0x0 0x0 0x0 0x0 0x0
<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
DevQ(0:0:0): 0 waiting
(scsi5:A:0): 80.000MB/s transfers (40.000MHz DT, 16bit)
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdg1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sde1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.

On 2/26/07, Hairul Ikmal Mohamad Fuzi <hairul.ikmal@xxxxxxxxx> wrote:
John, Vasiliy,

Thanks for the input.
We managed to figure out the problem after swapping all the items.
It seems the SCSI host adapter is giving us the problem.
Cheers.

-Ikmal

On 2/25/07, Vasiliy Boulytchev <vasiliy@xxxxxxxxxxxxxxxx> wrote:
> Agreed, I was thinking of cables as well.
>
> See if you get better performance when you replace the cables :)
>
> Good luck
>
> John R Pierce wrote:
> >
> > Hairul Ikmal Mohamad Fuzi wrote:
> >> Hi,
> >>
> >> Currently we are running CentOS 4.x on a 2-way Opteron machine.
> >> This machine, through a SCSI host adapter (Adaptec), is connected to a
> >> 2TB storage unit (an external RAID-5 disk array)
> >>
> >> Until our recent unintentional power trip, everything was fine and
> >> smooth.
> >> We have been experiencing complication accessing the storage ( it
> >> could be either intermittent filesystem error, partition could not be
> >> mounted in read-write mode, unacceptable writing speed, etc ),
> >> especially when we start to 'write' on the storage.
> >>
> >> After a few check, we are suspecting either :
> >>
> >> 1) the storage unit (but the storage control panel did not report any
> >> disk/raidset failure) is failing or,
> >> 2) the SCSI host adapter is failing, or
> >> 3) the filesystem itself is corrupted (we did 'fsck.ext3 -v -f' but it
> >> turned out it did not find any errors)
> >
> >
> > or 4) scsi cabling.   I see some scsi transmission errors in there.
> > About the only way I know to diagnose something like this would be to
> > swap parts... I'd swap the controller card and see if the problems go
> > away, then try the cable, then try the storage controller.   if one of
> > these things fixes the problem back the other changes out (ie put the
> > original card back, etc).
> > _______________________________________________
> > CentOS mailing list
> > CentOS@xxxxxxxxxx
> > http://lists.centos.org/mailman/listinfo/centos
> _______________________________________________
> CentOS mailing list
> CentOS@xxxxxxxxxx
> http://lists.centos.org/mailman/listinfo/centos
>

_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos