Re: Determining what is causing load when server is idle.

George Magklaras <georgios@xxxxxxxxxxxx> · Thu, 20 Apr 2006 11:28:50 +0200

Hi Ray,

Sorry, I lost the thread on my browser. I have looked at the figures and 
no I do not see an excessive number of page faults. The vmstat you 
indicated on a very quiet system does indeed point to an unusually high 
number of interrupts.

If you still have this issue, try booting on a uniprocessor kernel or 
boot with the 'noapic' option and see if you face the symptom or your 
system boots properly. I also note that the kernel you mentioned did not 
appear to be the latest. So, I would up2date the system, see if the 
problem goes away and then on the up-to-date kernel (2.6.9-34.ELsmp as 
we speak) boot with the noapic option and see if you see a difference.

Best Regards,
GM

Ray Van Dolson wrote:
On Fri, Apr 07, 2006 at 12:17:30PM +0200, George Magklaras wrote:

Seeing init in S mode in 'top' like that:
 1 root      16   0  1972  556  480 S  0.0  0.0   0:00.53 init

is not so extraordinary if you just invoke 'top'. If it is in R or other 
process mode continuously, that would be alarming.

init stays in 'S' mode for the duration of top.

Another symptom that comes along with this weird non-0.00 load issue is 
that
user I/O seems to "glitch" every now and then.  Almost like the hard drives
are spinning up after being put to sleep... however, APM is disabled in my
kernel since I am running in SMP mode.

I think that #might# be the key symptom. How exactly do you mean the 
'glitch'. Does I/O pause for an interval to the point where you notice 
it for several seconds and then continues, abort completely (I/O 
errors)? It could be that there is somekind of background reconstruction 
or syncing happenning due to driver or hardware issues.

Yes.  This is exactly the behavior I'm experiencing.  Everything just pauses
then within 2-5 seconds control returns.

dmesg | grep -i md

should give you any hickups related to the RAID config. Doing also a 
'vmstat 3'

Nothing interesting really in the dmesg output, but vmstat shows a lot of
interrupts:

On DL140G2 w/ SATA software RAID1:

[root@localhost oracle]# vmstat 3
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0  89340  46236 1828624    0    0     1    29  185    17  0  0 98  1
 0  0      0  89340  46236 1828624    0    0     0    11 1014    17  0  0 100  1
 0  0      0  89276  46236 1828624    0    0     0    28 1017    25  0  0 99  1
 0  0      0  89276  46236 1828624    0    0     0    11 1014    19  0  0 100  1
 0  0      0  89276  46236 1828624    0    0     0    21 1016    24  0  0 99  1
 0  0      0  89276  46236 1828624    0    0     0    11 1014    19  0  0 99  1
 0  0      0  89276  46236 1828624    0    0     0    20 1016    24  0  0 99  1
 0  0      0  89276  46236 1828624    0    0     0    11 1013    19  0  0 99  1

On DL140G1 w/ IDE software RAID1 (this box is actually in production so is
"busier" than the box above)

[root@billmax root]# vmstat 3
procs                      memory      swap          io     system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0  24604  18508 127772 507604    0    0     1     0    1     0  0  0  0  1
 0  0  24604  18508 127772 507604    0    0     0     0  113    24  0  0 100  0
 0  0  24604  18508 127772 507604    0    0     0    16  115    29  0  0 100  0
 0  0  24604  18508 127772 507604    0    0     0   111  164    52  0  0 96  3
 0  0  24604  18508 127772 507604    0    0     0     0  121    33  0  0 100  0
 0  0  24604  18508 127772 507608    0    0     1     7  116    48  0  0 100  0
 0  0  24604  18508 127772 507620    0    0     3     0  113    38  0  0 100  0
 0  0  24604  18508 127772 507620    0    0     0    51  131    26  0  0 100  0
 0  0  24604  18508 127772 507620    0    0     0     0  113    34  0  0 100  0

/proc/interrupts, the output of 'lsmod' and your SoftRAID configs files 
would help, as well as your kernel version.

Kernel is 2.6.9-22.ELsmp.

[root@localhost oracle]# cat /proc/interrupts 
           CPU0       CPU1       
  0:   33071575   33118497    IO-APIC-edge  timer
  1:         28         58    IO-APIC-edge  i8042
  8:          0          1    IO-APIC-edge  rtc
  9:          0          0   IO-APIC-level  acpi
 14:      79946      81927    IO-APIC-edge  libata
 15:      81059      80767    IO-APIC-edge  libata
169:    1037048        132   IO-APIC-level  uhci_hcd, eth0
177:          0          0   IO-APIC-level  uhci_hcd
185:          0          0   IO-APIC-level  ehci_hcd
NMI:          0          0 
LOC:   66192663   66192736 
ERR:          0
MIS:          0

RAID configuration -- it doesn't appear that /etc/raidtab gets generated any
longer.  Here is /etc/mdadm.conf:

DEVICE partitions
MAILADDR root
ARRAY /dev/md0 super-minor=0
ARRAY /dev/md1 super-minor=1

Some output from dmesg:

md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0x1470 irq 14
ata2: SATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0x1478 irq 15
md: raid1 personality registered as nr 3
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sdb3 ...
md:  adding sdb3 ...
md: sdb1 has different UUID to sdb3
md:  adding sda3 ...
md: sda1 has different UUID to sdb3
md: created md0
md: bind<sda3>
md: bind<sdb3>
md: running: <sdb3><sda3>
raid1: raid set md0 active with 2 out of 2 mirrors
md: considering sdb1 ...
md:  adding sdb1 ...
md:  adding sda1 ...
md: created md1
md: bind<sda1>
md: bind<sdb1>
md: running: <sdb1><sda1>
raid1: raid set md1 active with 2 out of 2 mirrors
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
EXT3 FS on md0, internal journal
EXT3 FS on md1, internal journal

[root@localhost oracle]# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sdb1[1] sda1[0]
      104320 blocks [2/2] [UU]

md0 : active raid1 sdb3[1] sda3[0]
      76991424 blocks [2/2] [UU]

unused devices: <none>

Here are also some sar statistics:

[root@localhost oracle]# sar
12:00:01 AM       CPU     %user     %nice   %system   %iowait     %idle
08:00:01 AM       all      0.00      0.00      0.01      0.92     99.06
08:10:01 AM       all      0.15      0.00      0.02      0.98     98.85
08:20:01 AM       all      0.01      0.00      0.01      0.95     99.03
08:30:01 AM       all      0.01      0.00      0.01      0.95     99.03
08:40:01 AM       all      0.00      0.00      0.01      1.05     98.94
08:50:01 AM       all      0.01      0.00      0.01      0.95     99.03
09:00:01 AM       all      0.02      0.00      0.02      0.95     99.00
09:10:01 AM       all      0.16      0.00      0.03      0.98     98.83
09:20:01 AM       all      0.01      0.00      0.01      0.96     99.02
Average:          all      0.04      0.01      0.03      1.01     98.91

iowait seems noticeably higher than on my DL140G1.

[root@localhost oracle]# sar -B
Linux 2.6.9-22.ELsmp (localhost.localdomain)    04/07/2006

12:00:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s
12:10:01 AM      0.07     19.70     45.95      0.00
12:20:01 AM      0.00     17.59     10.47      0.00
12:30:01 AM      0.00     17.10      9.02      0.00
12:40:01 AM      0.00     21.03     15.56      0.00
12:50:01 AM      0.00     17.34     15.80      0.00
01:00:01 AM      0.00     17.20      8.97      0.00
01:10:01 AM      0.00     19.50     45.04      0.00
01:20:01 AM      0.00     17.49      9.28      0.00
01:30:01 AM      0.00     17.22      8.94      0.00
01:40:01 AM      0.00     20.27     15.61      0.00
01:50:01 AM      0.00     17.08      9.10      0.00

Not sure if the number of page faults there is unusual or not.

The most unusual thing seems to be the number of interrupts going on.  I
can't seem to call sar -I with an IRQ value of 0, but a watch -n 1 "cat
/proc/interrupts" seems to show about 1000 interrupts per second to the
IO-APIC-edge timer on the DL140G2 system.

On the DL140G1 system, I am only seeing about 100 interrupts per second to
the IO-APIC-edge timer.

Anyways, I am going to keep playing around with sar and see if anything else
stands out.  Any suggestions?

Ray

--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list