RE: EXTERNAL:Re: RHELv4 and v5 - So slow as to be unusable.

"Burke, Thomas G." <tg.burke@xxxxxxx> · Wed, 13 Oct 2010 09:44:58 -0500

Interesting.  You mentioned file IO maybe being related?  Is this a custom kernel?  Is it possible you're using a circuitous IO route to get to disk (like maybe IDE SCSI, or some such?)

-----Original Message-----
From: redhat-list-bounces@xxxxxxxxxx [mailto:redhat-list-bounces@xxxxxxxxxx] On Behalf Of Mohammad Zakaria
Sent: Wednesday, October 13, 2010 8:44 AM
To: General Red Hat Linux discussion list
Subject: EXTERNAL:Re: RHELv4 and v5 - So slow as to be unusable.

Hello Grey,

Are those machines having the same brand or HW??
and if you reconnect to your NTP server does the clock start counting right or at the same 7.8 rate??
try to reset one of the machines BIOS to its defaults and check the results??
have you tried to disable one of your processors cores and work with a single processor??

--- On Sat, 10/9/10, Mohammad Zakaria <myz_sa@xxxxxxxxx> wrote:

From: Mohammad Zakaria <myz_sa@xxxxxxxxx>
Subject: Re: RHELv4 and v5 - So slow as to be unusable.
To: "General Red Hat Linux discussion list" <redhat-list@xxxxxxxxxx>
Date: Saturday, October 9, 2010, 12:58 PM

If you have one piece of RAM try to replace it and check your box status, or if 
it is a combination of 2 sets, try the system performance with each RAM 
separately, if there is any problem with your RAM HW you should detect that 
easily and fix it.

________________________________
From: Gary E Barnes <gebarnes@xxxxxxxxxx>
To: redhat-list@xxxxxxxxxx
Sent: Thu, October 7, 2010 9:17:01 PM
Subject: Re: RHELv4 and v5 - So slow as to be unusable.

> From: Laszlo Beres <laszlo@xxxxxxxx>
> Subject: Re: RHELv4 and v5 - So slow as to be unusable.
> 
> On Wed, Oct 6, 2010 at 9:22 PM, Gary E Barnes <gebarnes@xxxxxxxxxx> 
wrote:
> 
> > "top" says that nothing is going on although the load average is 3+.
> > "sar" also says that nothing is going on.
> 
> There's no such thing "nothing is going on". You should see CPU
> status, process status, etc. vmstat also can give you some hints about
> the system health.

Oh but there is such a thing.  I have one of the machines in this weird 
slowdown state right at this moment.  It started around 4:45pm yesterday, 
after running perfectly for about 3 hours 15 minutes, and I left it 
overnight to see if maybe it would get "over it" by itself.  Hasn't 
happened though.

Here is the very first header from the "top" display of a top I started 
just for this example.

top - 19:18:20 up  4:33,  4 users,  load average: 3.56, 3.58, 3.54
Tasks: 159 total,  16 running, 143 sleeping,  0 stopped,  0 zombie
Cpu(s):  1.3% us,  0.4% sy,  2.9% ni, 95.1% id,  0.3% wa,  0.0% hi,  0.0% 
si
Mem:  2586400k total,  1880032k used,  706368k free,  193036k buffers
Swap:  4192956k total,        0k used,  4192956k free,  1220324k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  
5813 root      16  0  164m  31m 6124 R  0.2  1.3  25:05.54 X  
6650 geb      16  0  4048 2116 1332 S  0.2  0.1  0:00.86 xalarm  
27229 geb2      16  0  3008  960  696 R  0.2  0.0  0:00.01 top  
    1 root      16  0  2724  512  436 S  0.0  0.0  0:00.68 init  

And here is the first refresh of that display (I'm capturing this in an 
Emacs buffer if you're curious).

Tasks: 161 total,  3 running, 158 sleeping,  0 stopped,  0 zombie
Cpu(s):  0.9% us,  0.5% sy,  0.1% ni, 98.5% id,  0.0% wa,  0.0% hi,  0.0% 
si
Mem:  2586400k total,  1885680k used,  700720k free,  193036k buffers
Swap:  4192956k total,        0k used,  4192956k free,  1220584k cached

Here is the second, notice the 100% idle value.

Tasks: 161 total,  2 running, 159 sleeping,  0 stopped,  0 zombie
Cpu(s):  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  0.0% 
si
Mem:  2586400k total,  1885696k used,  700704k free,  193044k buffers
Swap:  4192956k total,        0k used,  4192956k free,  1220576k cached

There is memory available.  There is swap available.
Idle occasionally drops to 99.9%.

Tasks: 161 total,  2 running, 159 sleeping,  0 stopped,  0 zombie
Cpu(s):  0.0% us,  0.1% sy,  0.0% ni, 99.8% id,  0.0% wa,  0.0% hi,  0.0% 
si
Mem:  2586400k total,  1885464k used,  700936k free,  193076k buffers
Swap:  4192956k total,        0k used,  4192956k free,  1220544k cached

The processes that show up in the first line or two of top are things such 
as:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  
27231 geb      16  0 28316  11m 8412 S  9.8  0.5  0:00.41 
gnome-terminal 
5813 root      16  0  165m  31m 6316 S  3.8  1.3  25:05.70 X  
6697 geb      16  0 22280  10m 7644 S  1.9  0.4  0:18.69 wnck-applet  

3817 rpc      15  0  2336  592  484 S  0.2  0.0  0:01.60 portmap  
6472 geb      16  0  3544 1472  876 S  0.2  0.1  0:02.08 gam_server  
6650 geb      16  0  4048 2116 1332 S  0.2  0.1  0:00.87 xalarm  
3805 root      16  0  2492  312  220 S  0.2  0.0  0:00.55 irqbalance  
4867 root      16  0  2800  844  624 D  0.2  0.0  0:03.32 rpc.mountd  
    1 root      16  0  2724  512  436 S  0.0  0.0  0:00.68 init  
6451 geb      16  0 12764 7512 1688 S  0.1  0.3  0:01.66 gconfd-2  
27229 geb2      16  0  3012 1044  772 R  0.1  0.0  0:00.05 top  
27229 geb2      16  0  3012 1044  772 R  0.5  0.0  0:00.07 top  
4996 root      16  0  4772 3104 1536 S  0.2  0.1  0:02.47 hald  
6472 geb      16  0  3544 1472  876 S  0.2  0.1  0:02.09 gam_server  
    1 root      16  0  2724  512  436 S  0.0  0.0  0:00.68 init  
27229 geb2      16  0  3012 1044  772 R  0.5  0.0  0:00.09 top  
23574 geb      16  0  145m  68m  26m S  0.2  2.7  2:44.86 firefox-bin  

    1 root      16  0  2724  512  436 S  0.0  0.0  0:00.68 init  

As you can see, there is essentially "nothing going on".

An yet the machine is very unresponsive.  If I run a command that hasn't 
been run in a while (don't know the time frame, but it seems to be only 
minutes) then the command takes >30 seconds to execute.  For example, I 
just did the "date" command and when it finally responded I did the 
hwclock command.  Both took >30 seconds to run.  Now if I repeat those 
commands they execute immediately.  I'm presuming that this is due to 
executable file caching in the operating system.  If I wait a while then 
the >30 second wait will reappear for those same commands.  Presumably 
they've left that cache.

This behavior is observable both in xterm's on the console and also 
through ssh connections from another machine. 

Programs that are already loaded and running seem to be pretty much ok, at 
least until they need to go read some new file or write some new file, 
then they hang for a while and eventually get going again.

If I run sar (sysstat package) I get essentially the same picture.  From a 
"sar -A 30 4" here are the averages for two minutes.  Load average 3+ and 
>99% idle.  Nearly no I/O of any sort; not 0 but very low amounts for two 
minutes.

Average:      proc/s
Average:        0.12

Average:      cswch/s
Average:        48.90

Average:          CPU    %user    %nice  %system  %iowait    %idle
Average:          all      0.29      0.01      0.12      0.07    99.51
Average:            0      0.25      0.01      0.13      0.00    99.60
Average:            1      0.33      0.01      0.11      0.13    99.42

Average:        INTR    intr/s
Average:          sum    11.87

Average:    pgpgin/s pgpgout/s  fault/s  majflt/s
Average:        0.01      8.55    34.76      0.00

Average:    pswpin/s pswpout/s
Average:        0.00      0.00

Average:          tps      rtps      wtps  bread/s  bwrtn/s
Average:        1.35      0.00      1.35      0.02    17.11

Average:      frmpg/s  bufpg/s  campg/s
Average:        -0.73      0.02    -0.02

Average:    CPU  i000/s  i001/s  i008/s  i009/s  i012/s  i014/s  i015/s 
i177/s  i185/s  i193/s  i201/s  i209/s
Average:      0    0.02    0.00    0.00    0.00    0.00    0.00    0.09  
7.76    0.00    0.00    0.00    0.00
Average:      1    0.01    2.65    0.00    0.00    0.00    1.35    0.00  
0.00    0.00    0.00    0.00    0.00

Average:        IFACE  rxpck/s  txpck/s  rxbyt/s  txbyt/s  rxcmp/s 
txcmp/s  rxmcst/s
Average:          lo      0.00      0.00      0.12      0.12      0.00  
0.00      0.00
Average:        eth0      3.35      2.25    364.83    196.51      0.00  
0.00      0.00
Average:        sit0      0.00      0.00      0.00      0.00      0.00  
0.00      0.00

Average:        IFACE  rxerr/s  txerr/s    coll/s  rxdrop/s  txdrop/s 
txcarr/s  rxfram/s  rxfifo/s  txfifo/s
Average:          lo      0.00      0.00      0.00      0.00      0.00  
0.00      0.00      0.00      0.00
Average:        eth0      0.00      0.00      0.00      0.00      0.00  
0.00      0.00      0.00      0.00
Average:        sit0      0.00      0.00      0.00      0.00      0.00  
0.00      0.00      0.00      0.00

Average:      DEV              tps    rd_sec/s  wr_sec/s
Average:      dev1-0          0.00      0.00      0.00
Average:      dev1-1          0.00      0.00      0.00
Average:      dev1-2          0.00      0.00      0.00
Average:      dev1-3          0.00      0.00      0.00
Average:      dev1-4          0.00      0.00      0.00
Average:      dev1-5          0.00      0.00      0.00
Average:      dev1-6          0.00      0.00      0.00
Average:      dev1-7          0.00      0.00      0.00
Average:      dev1-8          0.00      0.00      0.00
Average:      dev1-9          0.00      0.00      0.00
Average:      dev1-10          0.00      0.00      0.00
Average:      dev1-11          0.00      0.00      0.00
Average:      dev1-12          0.00      0.00      0.00
Average:      dev1-13          0.00      0.00      0.00
Average:      dev1-14          0.00      0.00      0.00
Average:      dev1-15          0.00      0.00      0.00
Average:      dev3-0          1.35      0.02    17.11
Average:      dev3-1          0.00      0.00      0.00
Average:      dev3-2          0.13      0.02      3.35
Average:      dev3-3          0.00      0.00      0.00
Average:      dev3-4          0.00      0.00      0.00
Average:      dev3-5          1.22      0.01    13.75
Average:      dev22-64        0.00      0.00      0.00
Average:      dev22-65        0.00      0.00      0.00
Average:      dev22-0          0.00      0.00      0.00
Average:      dev2-0          0.00      0.00      0.00
Average:      dev9-0          0.00      0.00      0.00

Average:    kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree 
kbswpused  %swpused  kbswpcad
Average:      696608  1889792    73.07    193202  1220938  4192956  
0      0.00        0

Average:    dentunusd  file-sz  inode-sz  super-sz %super-sz  dquot-sz 
%dquot-sz  rtsig-sz %rtsig-sz
Average:      216815      3285    185128        0      0.00        0  
0.00        0      0.00

Average:      totsck    tcpsck    udpsck    rawsck  ip-frag
Average:          343        56        8        0        0

Average:      runq-sz  plist-sz  ldavg-1  ldavg-5  ldavg-15
Average:          14      186      3.35      3.61      3.59

The machine entered this state at about 4:45pm yesterday afternoon.  It is 
now 12:00 noon the next day.
The "date" command says that the system thinks that the time is 7:26PM 
yesterday.
In the last 47 minutes the system clock has gained only 6 minutes.  A rate 
of somewhere around 7.8.
Another interesting little symptom, when this slowdown is in effect the 
keyboard autorepeat on keys stops working.

If this was the only machine doing this I'd think it was a hardware 
problem.  But (a) it isn't the only machine and (b) while it seems to 
always happen to these machines, it is only after running for at least a 
few hours without problems.

        Gary
-- 
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

-- 
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

-- 
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

-- 
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list