Hello, We have a IBM xSeries 330 server running with Red Hat 7.2. Server works fine most of the time but every now and then it just hangs. At first I thought it had something to do with SMP but now it seems to be a RAID related problem. If I'm lucky enough, I'm still able to use the console after the hangup if I'm already logged in. It seems that whenever we encounter this hang, all hd activity has stopped. Usually I'm able to use commands which are already loaded in memory, but everything which has something to do with hd fails. Physically the server is blinking it's disk lights as usual even after the hang. Red Hat's own kernel (2.4.9-13) does crash the server after 1-12 hours of uptime, with our home-cooked 2.4.17 we have reached even 40 days of uptime but after that again... a hang-up. Is this a known issue? Is there a known bug lurking somewhere in the SCSI/RAID/other hd related code? Some information about the server: - Red Hat 7.2 - Linux kernel 2.4.17 - IBM ServerRAID (version 4.80.26) - 1 GB RAM - 2xPentium III Xeon And hell lot of more of information: lspci output: --- 00:00.0 Host bridge: ServerWorks CNB20HE Host Bridge (rev 21) 00:00.1 Host bridge: ServerWorks CNB20HE Host Bridge (rev 01) 00:00.2 Host bridge: ServerWorks: Unknown device 0006 00:00.3 Host bridge: ServerWorks: Unknown device 0006 00:05.0 Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet LANCE] (rev 44) 00:06.0 VGA compatible controller: S3 Inc. Savage 4 (rev 04) 00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 4f) 00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller 00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04) 02:01.0 SCSI storage controller: Adaptec 7899P (rev 01) 02:01.1 SCSI storage controller: Adaptec 7899P (rev 01) 05:02.0 RAID bus controller: IBM Netfinity ServeRAID controller --- dmesg output: --- Linux version 2.4.17 (root@mbnet.mbnet.fi) (gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)) #3 SMP Mon Jan 21 18:55:55 EET 2002 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009d000 (usable) BIOS-e820: 000000000009d000 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000003fff9380 (usable) BIOS-e820: 000000003fff9380 - 0000000040000000 (ACPI data) BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved) 127MB HIGHMEM available. found SMP MP-table at 0009e1d0 hm, page 0009e000 reserved twice. hm, page 0009f000 reserved twice. hm, page 0009e000 reserved twice. hm, page 0009f000 reserved twice. WARNING: MP table in the EBDA can be UNSAFE, contact linux-smp@vger.kernel.org if you experience SMP problems! On node 0 totalpages: 262137 zone(0): 4096 pages. zone(1): 225280 pages. zone(2): 32761 pages. Intel MultiProcessor Specification v1.4 Virtual Wire compatibility mode. OEM ID: IBM ENSW Product ID: NF 6000R SMP APIC at: 0xFEE00000 Processor #1 Pentium(tm) Pro APIC version 17 Processor #0 Pentium(tm) Pro APIC version 17 I/O APIC #14 Version 17 at 0xFEC00000. I/O APIC #13 Version 17 at 0xFEC01000. Processors: 2 Kernel command line: auto BOOT_IMAGE=linuxjaba2417 ro root=808 BOOT_FILE=/boot/vmlinuz-jaba-2417 Initializing CPU#0 Detected 701.803 MHz processor. Console: colour VGA+ 80x25 Calibrating delay loop... 1399.19 BogoMIPS Memory: 1029052k/1048548k available (1573k kernel code, 19100k reserved, 449k data, 220k init, 131044k highmem) Dentry-cache hash table entries: 131072 (order: 8, 1048576 bytes) Inode-cache hash table entries: 65536 (order: 7, 524288 bytes) Mount-cache hash table entries: 16384 (order: 5, 131072 bytes) Buffer-cache hash table entries: 65536 (order: 6, 262144 bytes) Page-cache hash table entries: 262144 (order: 8, 1048576 bytes) CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0 CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 1024K CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. CPU: After generic, caps: 0383fbff 00000000 00000000 00000000 CPU: Common caps: 0383fbff 00000000 00000000 00000000 Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Checking 'hlt' instruction... OK. POSIX conformance testing by UNIFIX mtrr: v1.40 (20010327) Richard Gooch (rgooch@atnf.csiro.au) mtrr: detected mtrr type: Intel CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0 CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 1024K CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000 Intel machine check reporting enabled on CPU#0. CPU: After generic, caps: 0383fbff 00000000 00000000 00000000 CPU: Common caps: 0383fbff 00000000 00000000 00000000 CPU0: Intel Pentium III (Cascades) stepping 01 per-CPU timeslice cutoff: 2927.55 usecs. enabled ExtINT on CPU#0 ESR value before enabling vector: 00000000 ESR value after enabling vector: 00000000 Booting processor 1/0 eip 2000 Initializing CPU#1 masked ExtINT on CPU#1 ESR value before enabling vector: 00000000 ESR value after enabling vector: 00000000 Calibrating delay loop... 1402.47 BogoMIPS CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0 CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 1024K CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000 Intel machine check reporting enabled on CPU#1. CPU: After generic, caps: 0383fbff 00000000 00000000 00000000 CPU: Common caps: 0383fbff 00000000 00000000 00000000 CPU1: Intel Pentium III (Cascades) stepping 01 Total of 2 processors activated (2801.66 BogoMIPS). ENABLING IO-APIC IRQs Setting 14 in the phys_id_present_map ...changing IO-APIC physical APIC ID to 14 ... ok. Setting 13 in the phys_id_present_map ...changing IO-APIC physical APIC ID to 13 ... ok. init IO_APIC IRQs IO-APIC (apicid-pin) 14-0, 13-10, 13-11, 13-12, 13-13, 13-14, 13-15 not connected. ..TIMER: vector=0x31 pin1=2 pin2=-1 ..MP-BIOS bug: 8254 timer not connected to IO-APIC ...trying to set up timer (IRQ0) through the 8259A ... failed. ...trying to set up timer as Virtual Wire IRQ... works. number of MP IRQ sources: 39. number of IO-APIC #14 registers: 16. number of IO-APIC #13 registers: 16. testing the IO APIC....................... IO APIC #14...... .... register #00: 0E000000 ....... : physical APIC id: 0E .... register #01: 000F0011 ....... : max redirection entries: 000F ....... : PRQ implemented: 0 ....... : IO APIC version: 0011 .... register #02: 0E000000 ....... : arbitration: 0E .... IRQ redirection table: NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: 00 000 00 1 0 0 0 0 0 0 00 01 003 03 0 0 0 0 0 1 1 39 02 000 00 1 0 0 0 0 0 0 00 03 003 03 0 0 0 0 0 1 1 41 04 003 03 0 0 0 0 0 1 1 49 05 003 03 1 1 0 1 0 1 1 51 06 003 03 0 0 0 0 0 1 1 59 07 003 03 0 0 0 0 0 1 1 61 08 003 03 0 0 0 0 0 1 1 69 09 003 03 1 1 0 1 0 1 1 71 0a 003 03 1 1 0 1 0 1 1 79 0b 003 03 1 1 0 1 0 1 1 81 0c 003 03 0 0 0 0 0 1 1 89 0d 003 03 0 0 0 0 0 1 1 91 0e 003 03 0 0 0 0 0 1 1 99 0f 003 03 1 1 0 1 0 1 1 A1 IO APIC #13...... .... register #00: 0D000000 ....... : physical APIC id: 0D .... register #01: 000F0011 ....... : max redirection entries: 000F ....... : PRQ implemented: 0 ....... : IO APIC version: 0011 .... register #02: 0D000000 ....... : arbitration: 0D .... IRQ redirection table: NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: 00 003 03 1 1 0 1 0 1 1 A9 01 003 03 1 1 0 1 0 1 1 B1 02 003 03 1 1 0 1 0 1 1 B9 03 003 03 1 1 0 1 0 1 1 C1 04 003 03 1 1 0 1 0 1 1 C9 05 003 03 1 1 0 1 0 1 1 D1 06 003 03 1 1 0 1 0 1 1 D9 07 003 03 1 1 0 1 0 1 1 E1 08 003 03 1 1 0 1 0 1 1 E9 09 003 03 1 1 0 1 0 1 1 32 0a 000 00 1 0 0 0 0 0 0 00 0b 000 00 1 0 0 0 0 0 0 00 0c 000 00 1 0 0 0 0 0 0 00 0d 000 00 1 0 0 0 0 0 0 00 0e 000 00 1 0 0 0 0 0 0 00 0f 000 00 1 0 0 0 0 0 0 00 IRQ to pin mappings: IRQ0 -> 0:2 IRQ1 -> 0:1 IRQ3 -> 0:3 IRQ4 -> 0:4 IRQ5 -> 0:5 IRQ6 -> 0:6 IRQ7 -> 0:7 IRQ8 -> 0:8 IRQ9 -> 0:9 IRQ10 -> 0:10 IRQ11 -> 0:11 IRQ12 -> 0:12 IRQ13 -> 0:13 IRQ14 -> 0:14 IRQ15 -> 0:15 IRQ16 -> 1:0 IRQ17 -> 1:1 IRQ18 -> 1:2 IRQ19 -> 1:3 IRQ20 -> 1:4 IRQ21 -> 1:5 IRQ22 -> 1:6 IRQ23 -> 1:7 IRQ24 -> 1:8 IRQ25 -> 1:9 .................................... done. Using local APIC timer interrupts. calibrating APIC timer ... ..... CPU clock speed is 701.7336 MHz. ..... host bus clock speed is 100.2475 MHz. cpu: 0, clocks: 1002475, slice: 334158 CPU0<T0:1002464,T1:668304,D:2,S:334158,C:1002475> cpu: 1, clocks: 1002475, slice: 334158 CPU1<T0:1002464,T1:334144,D:4,S:334158,C:1002475> checking TSC synchronization across CPUs: passed. Waiting on wait_init_idle (map = 0x2) All processors have done init_idle PCI: PCI BIOS revision 2.10 entry at 0xfd32c, last bus=8 PCI: Using configuration type 1 PCI: Probing PCI hardware PCI: Discovered peer bus 02 PCI: Discovered peer bus 05 PCI->APIC IRQ transform: (B0,I5,P0) -> 16 PCI->APIC IRQ transform: (B0,I15,P0) -> 19 PCI->APIC IRQ transform: (B2,I1,P0) -> 17 PCI->APIC IRQ transform: (B2,I1,P1) -> 18 PCI->APIC IRQ transform: (B5,I2,P0) -> 21 Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket IBM machine detected. Enabling interrupts during APM calls. Starting kswapd allocated 32 pages and 32 bhs reserved for the highmem bounces VFS: Diskquotas version dquot_6.4.0 initialized Journalled Block Device driver loaded pty: 256 Unix98 ptys configured Serial driver version 5.05c (2001-07-08) with MANY_PORTS SHARE_IRQ SERIAL_PCI enabled ttyS00 at 0x03f8 (irq = 4) is a 16550A ttyS01 at 0x02f8 (irq = 3) is a 16550A Real Time Clock Driver v1.10e block: 128 slots per queue, batch=32 RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize Uniform Multi-Platform E-IDE driver Revision: 6.31 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx ServerWorks OSB4: IDE controller on PCI bus 00 dev 79 ServerWorks OSB4: chipset revision 0 ServerWorks OSB4: not 100% native mode: will probe irqs later ide0: BM-DMA at 0x0700-0x0707, BIOS settings: hda:DMA, hdb:DMA ide1: BM-DMA at 0x0708-0x070f, BIOS settings: hdc:pio, hdd:pio hda: LG CD-ROM CRD-8484B, ATAPI CD/DVD-ROM drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: ATAPI 48X CD-ROM drive, 128kB Cache, UDMA(33) Uniform CD-ROM driver Revision: 3.12 Floppy drive(s): fd0 is 1.44M FDC 0 is a National Semiconductor PC87306 loop: loaded (max 8 devices) SCSI subsystem driver Revision: 1.00 scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.4 <Adaptec aic7899 Ultra160 SCSI adapter> aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.4 <Adaptec aic7899 Ultra160 SCSI adapter> aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs scsi2 : IBM PCI ServeRAID 4.80.26 <ServeRAID 4M> Vendor: IBM Model: SERVERAID Rev: 1.00 Type: Direct-Access ANSI SCSI revision: 02 Vendor: IBM Model: SERVERAID Rev: 1.00 Type: Processor ANSI SCSI revision: 02 Vendor: IBM Model: YGLv3 S2 Rev: 0 Type: Processor ANSI SCSI revision: 02 Vendor: IBM Model: YGHv3 S2 Rev: 0 Type: Processor ANSI SCSI revision: 02 Attached scsi disk sda at scsi2, channel 0, id 0, lun 0 SCSI device sda: 355481600 512-byte hdwr sectors (182007 MB) Partition check: sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 > Attached scsi generic sg1 at scsi2, channel 0, id 15, lun 0, type 3 Attached scsi generic sg2 at scsi2, channel 1, id 8, lun 0, type 3 Attached scsi generic sg3 at scsi2, channel 1, id 9, lun 0, type 3 Linux Kernel Card Services 3.1.22 options: [pci] [cardbus] md: raid5 personality registered as nr 4 raid5: measuring checksumming speed 8regs : 1294.400 MB/sec 32regs : 631.200 MB/sec pIII_sse : 1436.400 MB/sec pII_mmx : 1578.000 MB/sec p5_mmx : 1643.600 MB/sec raid5: using function: pIII_sse (1436.400 MB/sec) md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27 md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. pci_hotplug: PCI Hot Plug PCI Core version: 0.3 NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP, IGMP IP: routing cache hash table of 8192 buckets, 64Kbytes TCP: Hash tables configured (established 262144 bind 65536) NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. ds: no socket drivers loaded! EXT3-fs: INFO: recovery required on readonly filesystem. EXT3-fs: write access will be enabled during recovery. kjournald starting. Commit interval 5 seconds EXT3-fs: mounted filesystem with ordered data mode. VFS: Mounted root (ext3 filesystem) readonly. Freeing unused kernel memory: 220k freed Adding Swap: 2096220k swap-space (priority -1) EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,8), internal journal kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,1), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,9), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,6), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,5), internal journal EXT3-fs: mounted filesystem with ordered data mode. pcnet32_probe_pci: found device 0x001022.0x002000 ioaddr=0x002200 resource_flags=0x000101 eth0: PCnet/FAST III 79C975 at 0x2200, 00 02 55 fc 1a 6e pcnet32: pcnet32_private lp=f6757000 lp_dma_addr=0x36757000 assigned IRQ 16. pcnet32.c:v1.25kf 26.9.1999 tsbogend@alpha.franken.de sending pkt_too_big to self --- ipssend output: ---- Found 1 IBM ServeRAID controller(s). Read configuration has been initiated for controller 1... ------------------------------------------------------------------------------- Controller information ------------------------------------------------------------------------------- Controller type : ServeRAID-4M BIOS version : 4.80.26 Firmware version : 4.80.26 Boot block version : 4.70.17 Device driver version : 4.80.26 Controller slot information : 2 Controller Name : Null Config SCSI channel description : 2 parallel SCSI wide Initiator IDs (Channel/SCSI ID): 1/7 2/7 Maximum physical devices : 30 Defunct disk drive count : 0 Logical drives/Offline/Critical: 1/0/0 Read ahead : Adaptive Stripe-unit size : 8 KB Rebuild rate (Low/Medium/High) : High Hot-swap rebuild : Enabled Data scrubbing : Enabled Part of cluster (Yes/No) : No Unattended mode (Yes/No) : No Concurrent commands supported : 96 Configuration update count : 25 ------------------------------------------------------------------------------- Logical drive information ------------------------------------------------------------------------------- Logical drive number 1 Status of logical drive : Okay (OKY) RAID level : 5 Size (in MB) : 173575 Write cache status : Write back (WB) Number of chunks : 6 Stripe-unit size : 8 KB Access blocked : No Part of array : A Part of merge group : 207 Array A stripe order (Channel/SCSI ID) : 1,0 1,1 1,2 1,12 1,13 1,14 ------------------------------------------------------------------------------- Physical device information ------------------------------------------------------------------------------- Channel #1: Initiator at SCSI ID 7 Target on SCSI ID 0 Device is a Hard disk SCSI ID : 0 PFA (Yes/No) : No State : Online (ONL) Size (in MB)/(in sectors): 34715/71096368 Device ID : IBM-PSG DDYS-T36S9HA4FY83763 FRU part number : 19K1469 Target on SCSI ID 1 Device is a Hard disk SCSI ID : 1 PFA (Yes/No) : No State : Online (ONL) Size (in MB)/(in sectors): 34715/71096368 Device ID : IBM-PSG DDYS-T36S9HA4FY8C929 FRU part number : 19K1469 Target on SCSI ID 2 Device is a Hard disk SCSI ID : 2 PFA (Yes/No) : No State : Online (ONL) Size (in MB)/(in sectors): 34715/71096368 Device ID : IBM-PSG DDYS-T36S9HA4FY88233 FRU part number : 19K1469 Target on SCSI ID 8 Device is a Processor device SCSI ID : 8 PFA (Yes/No) : No State : Standby (SBY) Size (in MB)/(in sectors): 0/0 Device ID : IBM YGLv3 S20 000 Target on SCSI ID 9 Device is a Processor device SCSI ID : 9 PFA (Yes/No) : No State : Standby (SBY) Size (in MB)/(in sectors): 0/0 Device ID : IBM YGHv3 S20 000 Target on SCSI ID 12 Device is a Hard disk SCSI ID : 12 PFA (Yes/No) : No State : Online (ONL) Size (in MB)/(in sectors): 34715/71096368 Device ID : IBM-PSG DDYS-T36S9HA4FY8C909 FRU part number : 19K1469 Target on SCSI ID 13 Device is a Hard disk SCSI ID : 13 PFA (Yes/No) : No State : Online (ONL) Size (in MB)/(in sectors): 34715/71096368 Device ID : IBM-PSG DDYS-T36S9HA4FY8D063 FRU part number : 19K1469 Target on SCSI ID 14 Device is a Hard disk SCSI ID : 14 PFA (Yes/No) : No State : Online (ONL) Size (in MB)/(in sectors): 34715/71096368 Device ID : IBM-PSG DDYS-T36S9HA4FY8D034 FRU part number : 19K1469 Channel #2: Initiator at SCSI ID 7 Command completed successfully. ---- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html