Re: Meaning of Cluster Cycle and timeout problems - GFS 100% cpu utilization

Peter <p.elmers@xxxxxx> · Mon, 21 Apr 2008 10:53:04 +0200

Hi,

Thanks for the fast response!

It looks like GFS causes 100% cpu utilization and therefore the qdiskd  
process has no processor time.

Is this a known problem and has anyone seen such behavior before?

We are using rhel 4.5 with the following packages:

ccs-1.0.11-1.x86_64.rpm
cman-1.0.17-0.x86_64.rpm
cman-kernel-2.6.9-53.5.x86_64.rpm
dlm-1.0.7-1.x86_64.rpm
dlm-kernel-2.6.9-52.2.x86_64.rpm
fence-1.32.50-2.x86_64.rpm
GFS-6.1.15-1.x86_64.rpm
GFS-kernel-2.6.9-75.9.x86_64.rpm
gulm-1.0.10-0.x86_64.rpm
iddev-2.0.0-4.x86_64.rpm
lvm2-cluster-2.02.27-2.el4.x86_64.rpm
magma-1.0.8-1.x86_64.rpm
magma-plugins-1.0.12-0.x86_64.rpm
perl-Net-Telnet-3.03-3.noarch.rpm
rgmanager-1.9.72-1.x86_64.rpm
system-config-cluster-1.0.51-2.0.noarch.rpm

The Kernel is 2.6.9-55.

Thanks for reading and answering,

Peter

Am 17.04.2008 um 20:54 schrieb Lon Hohberger:

On Thu, 2008-04-17 at 09:08 +0200, Peter wrote:
Hi!

In our Cluster we have the following entry in the "messages" logfile:

"qdiskd[4314]: <warning> qdisk cycle took more than 3 seconds to
complete (3.890000)"

It means it took more than 3 seconds for one qdiskd cycle to complete.
This is a whole lot:

  8192 bytes in 16 block reads
  some internal calculations
  512  bytes in 1 block write

(that's it...)

Theese messages are very frequent. I can not find anything except the
source code via google and i am sorry to say that i am not so familar
with c to get the point.

We also have sometimes a quorum timeout:

"kernel: CMAN: Quorum device /dev/sdh timed out"

Are theese two messages independent and what is the meaning of the
first message?

No, they're 100% related.  It sounds like qdiskd is getting starved  
for
I/O to /dev/sdh, or possibly it's getting CPU-starved for some reason.
Being that it's more or less a real-time program which helps keep the
cluster running, that's bad!  In your case, it's getting hung up for
longer than the cluster failover time, so CMAN thinks qdiskd has died.
Not good.

(1) Turn *off* status_file if you have it enabled!  It's for  
debugging,
and under certain load patterns, it can really slow down qdiskd.

(2) If you think it's I/O, what you should try is (assuming you're  
using
cluster2/rhel5/centos5/etc. here):

 echo deadline > /sys/block/sdh/queue

If you had a default of 10 seconds (1 interval 10 tko), you should  
also
do:

 echo 2500 > /sys/block/sdh/queue/iosched/write_expire

... you've got at least 3 for interval, so I'm not sure this would  
apply
to you.

[NOTE: On rhel4/centos4/stable, I think you have to set the I/O
scheduler globally in the kernel command line at system boot.]

(3) If you think qdiskd is getting CPU starved, you can adjust the
'scheduler' and 'priority' values in cluster.conf to something
different.  I think the man page might be wrong; I think the highest
'priority' value for the 'rr' scheduler is 99, not 100.  See the
qdisk(5) man page for more information on those.

-- Lon

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster