Hi Ray,
Sorry, I lost the thread on my browser. I have looked at the figures and
no I do not see an excessive number of page faults. The vmstat you
indicated on a very quiet system does indeed point to an unusually high
number of interrupts.
If you still have this issue, try booting on a uniprocessor kernel or
boot with the 'noapic' option and see if you face the symptom or your
system boots properly. I also note that the kernel you mentioned did not
appear to be the latest. So, I would up2date the system, see if the
problem goes away and then on the up-to-date kernel (2.6.9-34.ELsmp as
we speak) boot with the noapic option and see if you see a difference.
Best Regards,
GM
Ray Van Dolson wrote:
On Fri, Apr 07, 2006 at 12:17:30PM +0200, George Magklaras wrote:
Seeing init in S mode in 'top' like that:
1 root 16 0 1972 556 480 S 0.0 0.0 0:00.53 init
is not so extraordinary if you just invoke 'top'. If it is in R or other
process mode continuously, that would be alarming.
init stays in 'S' mode for the duration of top.
Another symptom that comes along with this weird non-0.00 load issue is
that
user I/O seems to "glitch" every now and then. Almost like the hard drives
are spinning up after being put to sleep... however, APM is disabled in my
kernel since I am running in SMP mode.
I think that #might# be the key symptom. How exactly do you mean the
'glitch'. Does I/O pause for an interval to the point where you notice
it for several seconds and then continues, abort completely (I/O
errors)? It could be that there is somekind of background reconstruction
or syncing happenning due to driver or hardware issues.
Yes. This is exactly the behavior I'm experiencing. Everything just pauses
then within 2-5 seconds control returns.
dmesg | grep -i md
should give you any hickups related to the RAID config. Doing also a
'vmstat 3'
Nothing interesting really in the dmesg output, but vmstat shows a lot of
interrupts:
On DL140G2 w/ SATA software RAID1:
[root@localhost oracle]# vmstat 3
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 89340 46236 1828624 0 0 1 29 185 17 0 0 98 1
0 0 0 89340 46236 1828624 0 0 0 11 1014 17 0 0 100 1
0 0 0 89276 46236 1828624 0 0 0 28 1017 25 0 0 99 1
0 0 0 89276 46236 1828624 0 0 0 11 1014 19 0 0 100 1
0 0 0 89276 46236 1828624 0 0 0 21 1016 24 0 0 99 1
0 0 0 89276 46236 1828624 0 0 0 11 1014 19 0 0 99 1
0 0 0 89276 46236 1828624 0 0 0 20 1016 24 0 0 99 1
0 0 0 89276 46236 1828624 0 0 0 11 1013 19 0 0 99 1
On DL140G1 w/ IDE software RAID1 (this box is actually in production so is
"busier" than the box above)
[root@billmax root]# vmstat 3
procs memory swap io system cpu
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 24604 18508 127772 507604 0 0 1 0 1 0 0 0 0 1
0 0 24604 18508 127772 507604 0 0 0 0 113 24 0 0 100 0
0 0 24604 18508 127772 507604 0 0 0 16 115 29 0 0 100 0
0 0 24604 18508 127772 507604 0 0 0 111 164 52 0 0 96 3
0 0 24604 18508 127772 507604 0 0 0 0 121 33 0 0 100 0
0 0 24604 18508 127772 507608 0 0 1 7 116 48 0 0 100 0
0 0 24604 18508 127772 507620 0 0 3 0 113 38 0 0 100 0
0 0 24604 18508 127772 507620 0 0 0 51 131 26 0 0 100 0
0 0 24604 18508 127772 507620 0 0 0 0 113 34 0 0 100 0
/proc/interrupts, the output of 'lsmod' and your SoftRAID configs files
would help, as well as your kernel version.
Kernel is 2.6.9-22.ELsmp.
[root@localhost oracle]# cat /proc/interrupts
CPU0 CPU1
0: 33071575 33118497 IO-APIC-edge timer
1: 28 58 IO-APIC-edge i8042
8: 0 1 IO-APIC-edge rtc
9: 0 0 IO-APIC-level acpi
14: 79946 81927 IO-APIC-edge libata
15: 81059 80767 IO-APIC-edge libata
169: 1037048 132 IO-APIC-level uhci_hcd, eth0
177: 0 0 IO-APIC-level uhci_hcd
185: 0 0 IO-APIC-level ehci_hcd
NMI: 0 0
LOC: 66192663 66192736
ERR: 0
MIS: 0
RAID configuration -- it doesn't appear that /etc/raidtab gets generated any
longer. Here is /etc/mdadm.conf:
DEVICE partitions
MAILADDR root
ARRAY /dev/md0 super-minor=0
ARRAY /dev/md1 super-minor=1
Some output from dmesg:
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0x1470 irq 14
ata2: SATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0x1478 irq 15
md: raid1 personality registered as nr 3
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sdb3 ...
md: adding sdb3 ...
md: sdb1 has different UUID to sdb3
md: adding sda3 ...
md: sda1 has different UUID to sdb3
md: created md0
md: bind<sda3>
md: bind<sdb3>
md: running: <sdb3><sda3>
raid1: raid set md0 active with 2 out of 2 mirrors
md: considering sdb1 ...
md: adding sdb1 ...
md: adding sda1 ...
md: created md1
md: bind<sda1>
md: bind<sdb1>
md: running: <sdb1><sda1>
raid1: raid set md1 active with 2 out of 2 mirrors
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
EXT3 FS on md0, internal journal
EXT3 FS on md1, internal journal
[root@localhost oracle]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]
md0 : active raid1 sdb3[1] sda3[0]
76991424 blocks [2/2] [UU]
unused devices: <none>
Here are also some sar statistics:
[root@localhost oracle]# sar
12:00:01 AM CPU %user %nice %system %iowait %idle
08:00:01 AM all 0.00 0.00 0.01 0.92 99.06
08:10:01 AM all 0.15 0.00 0.02 0.98 98.85
08:20:01 AM all 0.01 0.00 0.01 0.95 99.03
08:30:01 AM all 0.01 0.00 0.01 0.95 99.03
08:40:01 AM all 0.00 0.00 0.01 1.05 98.94
08:50:01 AM all 0.01 0.00 0.01 0.95 99.03
09:00:01 AM all 0.02 0.00 0.02 0.95 99.00
09:10:01 AM all 0.16 0.00 0.03 0.98 98.83
09:20:01 AM all 0.01 0.00 0.01 0.96 99.02
Average: all 0.04 0.01 0.03 1.01 98.91
iowait seems noticeably higher than on my DL140G1.
[root@localhost oracle]# sar -B
Linux 2.6.9-22.ELsmp (localhost.localdomain) 04/07/2006
12:00:01 AM pgpgin/s pgpgout/s fault/s majflt/s
12:10:01 AM 0.07 19.70 45.95 0.00
12:20:01 AM 0.00 17.59 10.47 0.00
12:30:01 AM 0.00 17.10 9.02 0.00
12:40:01 AM 0.00 21.03 15.56 0.00
12:50:01 AM 0.00 17.34 15.80 0.00
01:00:01 AM 0.00 17.20 8.97 0.00
01:10:01 AM 0.00 19.50 45.04 0.00
01:20:01 AM 0.00 17.49 9.28 0.00
01:30:01 AM 0.00 17.22 8.94 0.00
01:40:01 AM 0.00 20.27 15.61 0.00
01:50:01 AM 0.00 17.08 9.10 0.00
Not sure if the number of page faults there is unusual or not.
The most unusual thing seems to be the number of interrupts going on. I
can't seem to call sar -I with an IRQ value of 0, but a watch -n 1 "cat
/proc/interrupts" seems to show about 1000 interrupts per second to the
IO-APIC-edge timer on the DL140G2 system.
On the DL140G1 system, I am only seeing about 100 interrupts per second to
the IO-APIC-edge timer.
Anyways, I am going to keep playing around with sar and see if anything else
stands out. Any suggestions?
Ray
--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list