Random hangs with IBM xSeries 330

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

We have a IBM xSeries 330 server running with Red Hat 7.2. Server works fine 
most of the time but every now and then it just hangs. At first I thought it 
had something to do with SMP but now it seems to be a RAID related problem. 
If I'm lucky enough, I'm still able to use the console after the hangup if 
I'm already logged in. It seems that whenever we encounter this hang, all hd 
activity has stopped. Usually I'm able to use commands which are already 
loaded in memory, but everything which has something to do with hd fails.

Physically the server is blinking it's disk lights as usual even after the 
hang.

Red Hat's own kernel (2.4.9-13) does crash the server after 1-12 hours of 
uptime, with our home-cooked 2.4.17 we have reached even 40 days of uptime 
but after that again... a hang-up.

Is this a known issue? Is there a known bug lurking somewhere in the 
SCSI/RAID/other hd related code?

Some information about the server: 

- Red Hat 7.2
- Linux kernel 2.4.17
- IBM ServerRAID (version 4.80.26)
- 1 GB RAM
- 2xPentium III Xeon

And hell lot of more of information:

lspci output:
---
00:00.0 Host bridge: ServerWorks CNB20HE Host Bridge (rev 21)
00:00.1 Host bridge: ServerWorks CNB20HE Host Bridge (rev 01)
00:00.2 Host bridge: ServerWorks: Unknown device 0006
00:00.3 Host bridge: ServerWorks: Unknown device 0006
00:05.0 Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet 
LANCE] (rev 44)
00:06.0 VGA compatible controller: S3 Inc. Savage 4 (rev 04)
00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 4f)
00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller
00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04)
02:01.0 SCSI storage controller: Adaptec 7899P (rev 01)
02:01.1 SCSI storage controller: Adaptec 7899P (rev 01)
05:02.0 RAID bus controller: IBM Netfinity ServeRAID controller
---

dmesg output:
---
Linux version 2.4.17 (root@mbnet.mbnet.fi) (gcc version 2.96 20000731 (Red 
Hat Linux 7.1 2.96-98)) #3 SMP Mon Jan 21 18:55:55 EET 2002
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009d000 (usable)
 BIOS-e820: 000000000009d000 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003fff9380 (usable)
 BIOS-e820: 000000003fff9380 - 0000000040000000 (ACPI data)
 BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
127MB HIGHMEM available.
found SMP MP-table at 0009e1d0
hm, page 0009e000 reserved twice.
hm, page 0009f000 reserved twice.
hm, page 0009e000 reserved twice.
hm, page 0009f000 reserved twice.
WARNING: MP table in the EBDA can be UNSAFE, contact 
linux-smp@vger.kernel.org if you experience SMP problems!
On node 0 totalpages: 262137
zone(0): 4096 pages.
zone(1): 225280 pages.
zone(2): 32761 pages.
Intel MultiProcessor Specification v1.4
    Virtual Wire compatibility mode.
OEM ID: IBM ENSW Product ID: NF 6000R SMP APIC at: 0xFEE00000
Processor #1 Pentium(tm) Pro APIC version 17
Processor #0 Pentium(tm) Pro APIC version 17
I/O APIC #14 Version 17 at 0xFEC00000.
I/O APIC #13 Version 17 at 0xFEC01000.
Processors: 2
Kernel command line: auto BOOT_IMAGE=linuxjaba2417 ro root=808 
BOOT_FILE=/boot/vmlinuz-jaba-2417
Initializing CPU#0
Detected 701.803 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1399.19 BogoMIPS
Memory: 1029052k/1048548k available (1573k kernel code, 19100k reserved, 449k 
data, 220k init, 131044k highmem)
Dentry-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
Mount-cache hash table entries: 16384 (order: 5, 131072 bytes)
Buffer-cache hash table entries: 65536 (order: 6, 262144 bytes)
Page-cache hash table entries: 262144 (order: 8, 1048576 bytes)
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU:     After generic, caps: 0383fbff 00000000 00000000 00000000
CPU:             Common caps: 0383fbff 00000000 00000000 00000000
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
mtrr: v1.40 (20010327) Richard Gooch (rgooch@atnf.csiro.au)
mtrr: detected mtrr type: Intel
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
Intel machine check reporting enabled on CPU#0.
CPU:     After generic, caps: 0383fbff 00000000 00000000 00000000
CPU:             Common caps: 0383fbff 00000000 00000000 00000000
CPU0: Intel Pentium III (Cascades) stepping 01
per-CPU timeslice cutoff: 2927.55 usecs.
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Booting processor 1/0 eip 2000
Initializing CPU#1
masked ExtINT on CPU#1
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 1402.47 BogoMIPS
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
Intel machine check reporting enabled on CPU#1.
CPU:     After generic, caps: 0383fbff 00000000 00000000 00000000
CPU:             Common caps: 0383fbff 00000000 00000000 00000000
CPU1: Intel Pentium III (Cascades) stepping 01
Total of 2 processors activated (2801.66 BogoMIPS).
ENABLING IO-APIC IRQs
Setting 14 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 14 ... ok.
Setting 13 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 13 ... ok.
init IO_APIC IRQs
 IO-APIC (apicid-pin) 14-0, 13-10, 13-11, 13-12, 13-13, 13-14, 13-15 not 
connected.
..TIMER: vector=0x31 pin1=2 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC
...trying to set up timer (IRQ0) through the 8259A ...  failed.
...trying to set up timer as Virtual Wire IRQ... works.
number of MP IRQ sources: 39.
number of IO-APIC #14 registers: 16.
number of IO-APIC #13 registers: 16.
testing the IO APIC.......................

IO APIC #14......
.... register #00: 0E000000
.......    : physical APIC id: 0E
.... register #01: 000F0011
.......     : max redirection entries: 000F
.......     : PRQ implemented: 0
.......     : IO APIC version: 0011
.... register #02: 0E000000
.......     : arbitration: 0E
.... IRQ redirection table:
 NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
 00 000 00  1    0    0   0   0    0    0    00
 01 003 03  0    0    0   0   0    1    1    39
 02 000 00  1    0    0   0   0    0    0    00
 03 003 03  0    0    0   0   0    1    1    41
 04 003 03  0    0    0   0   0    1    1    49
 05 003 03  1    1    0   1   0    1    1    51
 06 003 03  0    0    0   0   0    1    1    59
 07 003 03  0    0    0   0   0    1    1    61
 08 003 03  0    0    0   0   0    1    1    69
 09 003 03  1    1    0   1   0    1    1    71
 0a 003 03  1    1    0   1   0    1    1    79
 0b 003 03  1    1    0   1   0    1    1    81
 0c 003 03  0    0    0   0   0    1    1    89
 0d 003 03  0    0    0   0   0    1    1    91
 0e 003 03  0    0    0   0   0    1    1    99
 0f 003 03  1    1    0   1   0    1    1    A1

IO APIC #13......
.... register #00: 0D000000
.......    : physical APIC id: 0D
.... register #01: 000F0011
.......     : max redirection entries: 000F
.......     : PRQ implemented: 0
.......     : IO APIC version: 0011
.... register #02: 0D000000
.......     : arbitration: 0D
.... IRQ redirection table:
 NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
 00 003 03  1    1    0   1   0    1    1    A9
 01 003 03  1    1    0   1   0    1    1    B1
 02 003 03  1    1    0   1   0    1    1    B9
 03 003 03  1    1    0   1   0    1    1    C1
 04 003 03  1    1    0   1   0    1    1    C9
 05 003 03  1    1    0   1   0    1    1    D1
 06 003 03  1    1    0   1   0    1    1    D9
 07 003 03  1    1    0   1   0    1    1    E1
 08 003 03  1    1    0   1   0    1    1    E9
 09 003 03  1    1    0   1   0    1    1    32
 0a 000 00  1    0    0   0   0    0    0    00
 0b 000 00  1    0    0   0   0    0    0    00
 0c 000 00  1    0    0   0   0    0    0    00
 0d 000 00  1    0    0   0   0    0    0    00
 0e 000 00  1    0    0   0   0    0    0    00
 0f 000 00  1    0    0   0   0    0    0    00
IRQ to pin mappings:
IRQ0 -> 0:2
IRQ1 -> 0:1
IRQ3 -> 0:3
IRQ4 -> 0:4
IRQ5 -> 0:5
IRQ6 -> 0:6
IRQ7 -> 0:7
IRQ8 -> 0:8
IRQ9 -> 0:9
IRQ10 -> 0:10
IRQ11 -> 0:11
IRQ12 -> 0:12
IRQ13 -> 0:13
IRQ14 -> 0:14
IRQ15 -> 0:15
IRQ16 -> 1:0
IRQ17 -> 1:1
IRQ18 -> 1:2
IRQ19 -> 1:3
IRQ20 -> 1:4
IRQ21 -> 1:5
IRQ22 -> 1:6
IRQ23 -> 1:7
IRQ24 -> 1:8
IRQ25 -> 1:9
.................................... done.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 701.7336 MHz.
..... host bus clock speed is 100.2475 MHz.
cpu: 0, clocks: 1002475, slice: 334158
CPU0<T0:1002464,T1:668304,D:2,S:334158,C:1002475>
cpu: 1, clocks: 1002475, slice: 334158
CPU1<T0:1002464,T1:334144,D:4,S:334158,C:1002475>
checking TSC synchronization across CPUs: passed.
Waiting on wait_init_idle (map = 0x2)
All processors have done init_idle
PCI: PCI BIOS revision 2.10 entry at 0xfd32c, last bus=8
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: Discovered peer bus 02
PCI: Discovered peer bus 05
PCI->APIC IRQ transform: (B0,I5,P0) -> 16
PCI->APIC IRQ transform: (B0,I15,P0) -> 19
PCI->APIC IRQ transform: (B2,I1,P0) -> 17
PCI->APIC IRQ transform: (B2,I1,P1) -> 18
PCI->APIC IRQ transform: (B5,I2,P0) -> 21
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
IBM machine detected. Enabling interrupts during APM calls.
Starting kswapd
allocated 32 pages and 32 bhs reserved for the highmem bounces
VFS: Diskquotas version dquot_6.4.0 initialized
Journalled Block Device driver loaded
pty: 256 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with MANY_PORTS SHARE_IRQ SERIAL_PCI 
enabled
ttyS00 at 0x03f8 (irq = 4) is a 16550A
ttyS01 at 0x02f8 (irq = 3) is a 16550A
Real Time Clock Driver v1.10e
block: 128 slots per queue, batch=32
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ServerWorks OSB4: IDE controller on PCI bus 00 dev 79
ServerWorks OSB4: chipset revision 0
ServerWorks OSB4: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0x0700-0x0707, BIOS settings: hda:DMA, hdb:DMA
    ide1: BM-DMA at 0x0708-0x070f, BIOS settings: hdc:pio, hdd:pio
hda: LG CD-ROM CRD-8484B, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: ATAPI 48X CD-ROM drive, 128kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.12
Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
loop: loaded (max 8 devices)
SCSI subsystem driver Revision: 1.00
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.4
        <Adaptec aic7899 Ultra160 SCSI adapter>
        aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.4
        <Adaptec aic7899 Ultra160 SCSI adapter>
        aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs

scsi2 : IBM PCI ServeRAID 4.80.26  <ServeRAID 4M>
  Vendor: IBM       Model: SERVERAID         Rev: 1.00
  Type:   Direct-Access                      ANSI SCSI revision: 02
  Vendor: IBM       Model: SERVERAID         Rev: 1.00
  Type:   Processor                          ANSI SCSI revision: 02
  Vendor: IBM       Model: YGLv3 S2          Rev: 0
  Type:   Processor                          ANSI SCSI revision: 02
  Vendor: IBM       Model: YGHv3 S2          Rev: 0
  Type:   Processor                          ANSI SCSI revision: 02
Attached scsi disk sda at scsi2, channel 0, id 0, lun 0
SCSI device sda: 355481600 512-byte hdwr sectors (182007 MB)
Partition check:
 sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 >
Attached scsi generic sg1 at scsi2, channel 0, id 15, lun 0,  type 3
Attached scsi generic sg2 at scsi2, channel 1, id 8, lun 0,  type 3
Attached scsi generic sg3 at scsi2, channel 1, id 9, lun 0,  type 3
Linux Kernel Card Services 3.1.22
  options:  [pci] [cardbus]
md: raid5 personality registered as nr 4
raid5: measuring checksumming speed
   8regs     :  1294.400 MB/sec
   32regs    :   631.200 MB/sec
   pIII_sse  :  1436.400 MB/sec
   pII_mmx   :  1578.000 MB/sec
   p5_mmx    :  1643.600 MB/sec
raid5: using function: pIII_sse (1436.400 MB/sec)
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
pci_hotplug: PCI Hot Plug PCI Core version: 0.3
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 8192 buckets, 64Kbytes
TCP: Hash tables configured (established 262144 bind 65536)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
ds: no socket drivers loaded!
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 220k freed
Adding Swap: 2096220k swap-space (priority -1)
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,8), internal journal
kjournald starting.  Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,1), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,9), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,6), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,5), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
pcnet32_probe_pci: found device 0x001022.0x002000
    ioaddr=0x002200  resource_flags=0x000101
eth0: PCnet/FAST III 79C975 at 0x2200, 00 02 55 fc 1a 6e
pcnet32: pcnet32_private lp=f6757000 lp_dma_addr=0x36757000 assigned IRQ 16.
pcnet32.c:v1.25kf 26.9.1999 tsbogend@alpha.franken.de
sending pkt_too_big to self
---

ipssend output:
----
Found 1 IBM ServeRAID controller(s).
Read configuration has been initiated for controller 1...
-------------------------------------------------------------------------------
Controller information
-------------------------------------------------------------------------------
   Controller type                : ServeRAID-4M
   BIOS version                   : 4.80.26
   Firmware version               : 4.80.26
   Boot block version             : 4.70.17
   Device driver version          : 4.80.26
   Controller slot information    : 2
   Controller Name                : Null Config
   SCSI channel description       : 2 parallel SCSI wide
   Initiator IDs (Channel/SCSI ID): 1/7 2/7
   Maximum physical devices       : 30
   Defunct disk drive count       : 0
   Logical drives/Offline/Critical: 1/0/0
   Read ahead                     : Adaptive
   Stripe-unit size               : 8 KB
   Rebuild rate (Low/Medium/High) : High
   Hot-swap rebuild               : Enabled
   Data scrubbing                 : Enabled
   Part of cluster (Yes/No)       : No
   Unattended mode (Yes/No)       : No
   Concurrent commands supported  : 96
   Configuration update count     : 25
-------------------------------------------------------------------------------
Logical drive information
-------------------------------------------------------------------------------
 Logical drive number 1
   Status of logical drive        : Okay (OKY)
   RAID level                     : 5
   Size (in MB)                   : 173575
   Write cache status             : Write back (WB)
   Number of chunks               : 6
   Stripe-unit size               : 8 KB
   Access blocked                 : No
   Part of array                  : A
   Part of merge group            : 207

   Array A stripe order (Channel/SCSI ID)  : 1,0 1,1 1,2 1,12 1,13 1,14
-------------------------------------------------------------------------------
Physical device information
-------------------------------------------------------------------------------
   Channel #1:
      Initiator at SCSI ID 7
      Target on SCSI ID 0
         Device is a Hard disk
         SCSI ID                  : 0
         PFA (Yes/No)             : No
         State                    : Online (ONL)
         Size (in MB)/(in sectors): 34715/71096368
         Device ID                : IBM-PSG DDYS-T36S9HA4FY83763
         FRU part number          : 19K1469
      Target on SCSI ID 1
         Device is a Hard disk
         SCSI ID                  : 1
         PFA (Yes/No)             : No
         State                    : Online (ONL)
         Size (in MB)/(in sectors): 34715/71096368
         Device ID                : IBM-PSG DDYS-T36S9HA4FY8C929
         FRU part number          : 19K1469
      Target on SCSI ID 2
         Device is a Hard disk
         SCSI ID                  : 2
         PFA (Yes/No)             : No
         State                    : Online (ONL)
         Size (in MB)/(in sectors): 34715/71096368
         Device ID                : IBM-PSG DDYS-T36S9HA4FY88233
         FRU part number          : 19K1469
      Target on SCSI ID 8
         Device is a Processor device
         SCSI ID                  : 8
         PFA (Yes/No)             : No
         State                    : Standby (SBY)
         Size (in MB)/(in sectors): 0/0
         Device ID                : IBM     YGLv3 S20   000
      Target on SCSI ID 9
         Device is a Processor device
         SCSI ID                  : 9
         PFA (Yes/No)             : No
         State                    : Standby (SBY)
         Size (in MB)/(in sectors): 0/0
         Device ID                : IBM     YGHv3 S20   000
      Target on SCSI ID 12
         Device is a Hard disk
         SCSI ID                  : 12
         PFA (Yes/No)             : No
         State                    : Online (ONL)
         Size (in MB)/(in sectors): 34715/71096368
         Device ID                : IBM-PSG DDYS-T36S9HA4FY8C909
         FRU part number          : 19K1469
      Target on SCSI ID 13
         Device is a Hard disk
         SCSI ID                  : 13
         PFA (Yes/No)             : No
         State                    : Online (ONL)
         Size (in MB)/(in sectors): 34715/71096368
         Device ID                : IBM-PSG DDYS-T36S9HA4FY8D063
         FRU part number          : 19K1469
      Target on SCSI ID 14
         Device is a Hard disk
         SCSI ID                  : 14
         PFA (Yes/No)             : No
         State                    : Online (ONL)
         Size (in MB)/(in sectors): 34715/71096368
         Device ID                : IBM-PSG DDYS-T36S9HA4FY8D034
         FRU part number          : 19K1469
   Channel #2:
      Initiator at SCSI ID 7
Command completed successfully.
----
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux