Hi, Thanks for the fast response!It looks like GFS causes 100% cpu utilization and therefore the qdiskd process has no processor time.
Is this a known problem and has anyone seen such behavior before? We are using rhel 4.5 with the following packages: ccs-1.0.11-1.x86_64.rpm cman-1.0.17-0.x86_64.rpm cman-kernel-2.6.9-53.5.x86_64.rpm dlm-1.0.7-1.x86_64.rpm dlm-kernel-2.6.9-52.2.x86_64.rpm fence-1.32.50-2.x86_64.rpm GFS-6.1.15-1.x86_64.rpm GFS-kernel-2.6.9-75.9.x86_64.rpm gulm-1.0.10-0.x86_64.rpm iddev-2.0.0-4.x86_64.rpm lvm2-cluster-2.02.27-2.el4.x86_64.rpm magma-1.0.8-1.x86_64.rpm magma-plugins-1.0.12-0.x86_64.rpm perl-Net-Telnet-3.03-3.noarch.rpm rgmanager-1.9.72-1.x86_64.rpm system-config-cluster-1.0.51-2.0.noarch.rpm The Kernel is 2.6.9-55. Thanks for reading and answering, Peter Am 17.04.2008 um 20:54 schrieb Lon Hohberger:
On Thu, 2008-04-17 at 09:08 +0200, Peter wrote:Hi! In our Cluster we have the following entry in the "messages" logfile: "qdiskd[4314]: <warning> qdisk cycle took more than 3 seconds to complete (3.890000)"It means it took more than 3 seconds for one qdiskd cycle to complete. This is a whole lot: 8192 bytes in 16 block reads some internal calculations 512 bytes in 1 block write (that's it...)Theese messages are very frequent. I can not find anything except the source code via google and i am sorry to say that i am not so familar with c to get the point. We also have sometimes a quorum timeout: "kernel: CMAN: Quorum device /dev/sdh timed out" Are theese two messages independent and what is the meaning of the first message?No, they're 100% related. It sounds like qdiskd is getting starved forI/O to /dev/sdh, or possibly it's getting CPU-starved for some reason. Being that it's more or less a real-time program which helps keep the cluster running, that's bad! In your case, it's getting hung up for longer than the cluster failover time, so CMAN thinks qdiskd has died. Not good.(1) Turn *off* status_file if you have it enabled! It's for debugging,and under certain load patterns, it can really slow down qdiskd.(2) If you think it's I/O, what you should try is (assuming you're usingcluster2/rhel5/centos5/etc. here): echo deadline > /sys/block/sdh/queueIf you had a default of 10 seconds (1 interval 10 tko), you should alsodo: echo 2500 > /sys/block/sdh/queue/iosched/write_expire... you've got at least 3 for interval, so I'm not sure this would applyto you. [NOTE: On rhel4/centos4/stable, I think you have to set the I/O scheduler globally in the kernel command line at system boot.] (3) If you think qdiskd is getting CPU starved, you can adjust the 'scheduler' and 'priority' values in cluster.conf to something different. I think the man page might be wrong; I think the highest 'priority' value for the 'rr' scheduler is 99, not 100. See the qdisk(5) man page for more information on those. -- Lon -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster