Shawn, Looking at the output below you may want to try and increase statfs_slots to 256. Also, if you have any disk monitoring utilities that monitor drive usage you may want to set statfs_fast equal to 1. --- Jay Shawn Hood wrote: High priorty support request, I mean. On Mon, Oct 13, 2008 at 5:32 PM, Shawn Hood <shawnlhood@xxxxxxxxx> wrote:As a heads up, I'm about to open a high priority bug on this. It's crippling us. Also, I meant to say it is a 4 node cluster, not a 3 node. Please let me know if I can provide any more information in addition to this. I will provide the information from a time series of gfs_tool counters commands with the support request. Shawn On Tue, Oct 7, 2008 at 1:40 PM, Shawn Hood <shawnlhood@xxxxxxxxx> wrote:More info: All filesystems mounted using noatime,nodiratime,noquota. All filesystems report the same data from gfs_tool gettune: limit1 = 100 ilimit1_tries = 3 ilimit1_min = 1 ilimit2 = 500 ilimit2_tries = 10 ilimit2_min = 3 demote_secs = 300 incore_log_blocks = 1024 jindex_refresh_secs = 60 depend_secs = 60 scand_secs = 5 recoverd_secs = 60 logd_secs = 1 quotad_secs = 5 inoded_secs = 15 glock_purge = 0 quota_simul_sync = 64 quota_warn_period = 10 atime_quantum = 3600 quota_quantum = 60 quota_scale = 1.0000 (1, 1) quota_enforce = 0 quota_account = 0 new_files_jdata = 0 new_files_directio = 0 max_atomic_write = 4194304 max_readahead = 262144 lockdump_size = 131072 stall_secs = 600 complain_secs = 10 reclaim_limit = 5000 entries_per_readdir = 32 prefetch_secs = 10 statfs_slots = 64 max_mhc = 10000 greedy_default = 100 greedy_quantum = 25 greedy_max = 250 rgrp_try_threshold = 100 statfs_fast = 0 seq_readahead = 0 And data on the FS from gfs_tool counters: locks 2948 locks held 1352 freeze count 0 incore inodes 1347 metadata buffers 0 unlinked inodes 0 quota IDs 0 incore log buffers 0 log space used 0.05% meta header cache entries 0 glock dependencies 0 glocks on reclaim list 0 log wraps 2 outstanding LM calls 0 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 223287 glock nq calls 1812286 glock dq calls 1810926 glock prefetch calls 101158 lm_lock calls 198294 lm_unlock calls 142643 lm callbacks 341621 address operations 502691 dentry operations 395330 export operations 0 file operations 199243 inode operations 984276 super operations 1727082 vm operations 0 block I/O reads 520531 block I/O writes 130315 locks 171423 locks held 85717 freeze count 0 incore inodes 85376 metadata buffers 1474 unlinked inodes 0 quota IDs 0 incore log buffers 24 log space used 0.83% meta header cache entries 6621 glock dependencies 2037 glocks on reclaim list 0 log wraps 428 outstanding LM calls 0 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 45784677 glock nq calls 962822941 glock dq calls 962595532 glock prefetch calls 20215922 lm_lock calls 40708633 lm_unlock calls 23410498 lm callbacks 64156052 address operations 705464659 dentry operations 19701522 export operations 0 file operations 364990733 inode operations 98910127 super operations 440061034 vm operations 7 block I/O reads 90394984 block I/O writes 131199864 locks 2916542 locks held 1476005 freeze count 0 incore inodes 1454165 metadata buffers 12539 unlinked inodes 100 quota IDs 0 incore log buffers 11 log space used 13.33% meta header cache entries 9928 glock dependencies 110 glocks on reclaim list 0 log wraps 2393 outstanding LM calls 25 outstanding BIO calls 0 fh2dentry misses 55546 glocks reclaimed 127341056 glock nq calls 867427 glock dq calls 867430 glock prefetch calls 36679316 lm_lock calls 110179878 lm_unlock calls 84588424 lm callbacks 194863553 address operations 250891447 dentry operations 359537343 export operations 390941288 file operations 399156716 inode operations 537830 super operations 1093798409 vm operations 774785 block I/O reads 258044208 block I/O writes 101585172 On Tue, Oct 7, 2008 at 1:33 PM, Shawn Hood <shawnlhood@xxxxxxxxx> wrote:Problem: It seems that IO on one machine in the cluster (not always the same machine) will hang and all processes accessing clustered LVs will block. Other machines will follow suit shortly thereafter until the machine that first exhibited the problem is rebooted (via fence_drac manually). No messages in dmesg, syslog, etc. Filesystems recently fsckd. Hardware: Dell 1950s (similar except memory -- 3x 16GB RAM, 1x 8GB RAM). Running RHEL4 ES U7. Four machines Onboard gigabit NICs (Machines use little bandwidth, and all network traffic including DLM share NICs) QLogic 2462 PCI-Express dual channel FC HBAs QLogic SANBox 5200 FC switch Apple XRAID which presents as two LUNs (~4.5TB raw aggregate) Cisco Catalyst switch Simple four machine RHEL4 U7 cluster running kernel 2.6.9-78.0.1.ELsmp x86_64 with the following packages: ccs-1.0.12-1 cman-1.0.24-1 cman-kernel-smp-2.6.9-55.13.el4_7.1 cman-kernheaders-2.6.9-55.13.el4_7.1 dlm-kernel-smp-2.6.9-54.11.el4_7.1 dlm-kernheaders-2.6.9-54.11.el4_7.1 fence-1.32.63-1.el4_7.1 GFS-6.1.18-1 GFS-kernel-smp-2.6.9-80.9.el4_7.1 One clustered VG. Striped across two physical volumes, which correspond to each side of an Apple XRAID. Clustered volume group info: --- Volume group --- VG Name hq-san System ID Format lvm2 Metadata Areas 2 Metadata Sequence No 50 VG Access read/write VG Status resizable Clustered yes Shared no MAX LV 0 Cur LV 3 Open LV 3 Max PV 0 Cur PV 2 Act PV 2 VG Size 4.55 TB PE Size 4.00 MB Total PE 1192334 Alloc PE / Size 905216 / 3.45 TB Free PE / Size 287118 / 1.10 TB VG UUID hfeIhf-fzEq-clCf-b26M-cMy3-pphm-B6wmLv Logical volumes contained with hq-san VG: cam_development hq-san -wi-ao 500.00G qa hq-san -wi-ao 1.07T svn_users hq-san -wi-ao 1.89T All four machines mount svn_users, two machines mount qa, and one mounts cam_development. /etc/cluster/cluster.conf: <?xml version="1.0"?> <cluster alias="tungsten" config_version="31" name="qualia"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="odin" votes="1"> <fence> <method name="1"> <device modulename="" name="odin-drac"/> </method> </fence> </clusternode> <clusternode name="hugin" votes="1"> <fence> <method name="1"> <device modulename="" name="hugin-drac"/> </method> </fence> </clusternode> <clusternode name="munin" votes="1"> <fence> <method name="1"> <device modulename="" name="munin-drac"/> </method> </fence> </clusternode> <clusternode name="zeus" votes="1"> <fence> <method name="1"> <device modulename="" name="zeus-drac"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="0"/> <fencedevices> <resources/> <fencedevice name="odin-drac" agent="fence_drac" ipaddr="redacted" login="root" passwd="redacted"/> <fencedevice name="hugin-drac" agent="fence_drac" ipaddr="redacted" login="root" passwd="redacted"/> <fencedevice name="munin-drac" agent="fence_drac" ipaddr="redacted" login="root" passwd="redacted"/> <fencedevice name="zeus-drac" agent="fence_drac" ipaddr="redacted" login="root" passwd="redacted"/> </fencedevices> <rm> <failoverdomains/> <resources/> </rm> </cluster> -- Shawn Hood 910.670.1819 m-- Shawn Hood 910.670.1819 m-- Shawn Hood 910.670.1819 m |
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster