High priorty support request, I mean. On Mon, Oct 13, 2008 at 5:32 PM, Shawn Hood <shawnlhood@xxxxxxxxx> wrote: > As a heads up, I'm about to open a high priority bug on this. It's > crippling us. Also, I meant to say it is a 4 node cluster, not a 3 > node. > > Please let me know if I can provide any more information in addition > to this. I will provide the information from a time series of > gfs_tool counters commands with the support request. > > Shawn > > On Tue, Oct 7, 2008 at 1:40 PM, Shawn Hood <shawnlhood@xxxxxxxxx> wrote: >> More info: >> >> All filesystems mounted using noatime,nodiratime,noquota. >> >> All filesystems report the same data from gfs_tool gettune: >> >> limit1 = 100 >> ilimit1_tries = 3 >> ilimit1_min = 1 >> ilimit2 = 500 >> ilimit2_tries = 10 >> ilimit2_min = 3 >> demote_secs = 300 >> incore_log_blocks = 1024 >> jindex_refresh_secs = 60 >> depend_secs = 60 >> scand_secs = 5 >> recoverd_secs = 60 >> logd_secs = 1 >> quotad_secs = 5 >> inoded_secs = 15 >> glock_purge = 0 >> quota_simul_sync = 64 >> quota_warn_period = 10 >> atime_quantum = 3600 >> quota_quantum = 60 >> quota_scale = 1.0000 (1, 1) >> quota_enforce = 0 >> quota_account = 0 >> new_files_jdata = 0 >> new_files_directio = 0 >> max_atomic_write = 4194304 >> max_readahead = 262144 >> lockdump_size = 131072 >> stall_secs = 600 >> complain_secs = 10 >> reclaim_limit = 5000 >> entries_per_readdir = 32 >> prefetch_secs = 10 >> statfs_slots = 64 >> max_mhc = 10000 >> greedy_default = 100 >> greedy_quantum = 25 >> greedy_max = 250 >> rgrp_try_threshold = 100 >> statfs_fast = 0 >> seq_readahead = 0 >> >> >> And data on the FS from gfs_tool counters: >> locks 2948 >> locks held 1352 >> freeze count 0 >> incore inodes 1347 >> metadata buffers 0 >> unlinked inodes 0 >> quota IDs 0 >> incore log buffers 0 >> log space used 0.05% >> meta header cache entries 0 >> glock dependencies 0 >> glocks on reclaim list 0 >> log wraps 2 >> outstanding LM calls 0 >> outstanding BIO calls 0 >> fh2dentry misses 0 >> glocks reclaimed 223287 >> glock nq calls 1812286 >> glock dq calls 1810926 >> glock prefetch calls 101158 >> lm_lock calls 198294 >> lm_unlock calls 142643 >> lm callbacks 341621 >> address operations 502691 >> dentry operations 395330 >> export operations 0 >> file operations 199243 >> inode operations 984276 >> super operations 1727082 >> vm operations 0 >> block I/O reads 520531 >> block I/O writes 130315 >> >> locks 171423 >> locks held 85717 >> freeze count 0 >> incore inodes 85376 >> metadata buffers 1474 >> unlinked inodes 0 >> quota IDs 0 >> incore log buffers 24 >> log space used 0.83% >> meta header cache entries 6621 >> glock dependencies 2037 >> glocks on reclaim list 0 >> log wraps 428 >> outstanding LM calls 0 >> outstanding BIO calls 0 >> fh2dentry misses 0 >> glocks reclaimed 45784677 >> glock nq calls 962822941 >> glock dq calls 962595532 >> glock prefetch calls 20215922 >> lm_lock calls 40708633 >> lm_unlock calls 23410498 >> lm callbacks 64156052 >> address operations 705464659 >> dentry operations 19701522 >> export operations 0 >> file operations 364990733 >> inode operations 98910127 >> super operations 440061034 >> vm operations 7 >> block I/O reads 90394984 >> block I/O writes 131199864 >> >> locks 2916542 >> locks held 1476005 >> freeze count 0 >> incore inodes 1454165 >> metadata buffers 12539 >> unlinked inodes 100 >> quota IDs 0 >> incore log buffers 11 >> log space used 13.33% >> meta header cache entries 9928 >> glock dependencies 110 >> glocks on reclaim list 0 >> log wraps 2393 >> outstanding LM calls 25 >> outstanding BIO calls 0 >> fh2dentry misses 55546 >> glocks reclaimed 127341056 >> glock nq calls 867427 >> glock dq calls 867430 >> glock prefetch calls 36679316 >> lm_lock calls 110179878 >> lm_unlock calls 84588424 >> lm callbacks 194863553 >> address operations 250891447 >> dentry operations 359537343 >> export operations 390941288 >> file operations 399156716 >> inode operations 537830 >> super operations 1093798409 >> vm operations 774785 >> block I/O reads 258044208 >> block I/O writes 101585172 >> >> >> >> On Tue, Oct 7, 2008 at 1:33 PM, Shawn Hood <shawnlhood@xxxxxxxxx> wrote: >>> Problem: >>> It seems that IO on one machine in the cluster (not always the same >>> machine) will hang and all processes accessing clustered LVs will >>> block. Other machines will follow suit shortly thereafter until the >>> machine that first exhibited the problem is rebooted (via fence_drac >>> manually). No messages in dmesg, syslog, etc. Filesystems recently >>> fsckd. >>> >>> Hardware: >>> Dell 1950s (similar except memory -- 3x 16GB RAM, 1x 8GB RAM). >>> Running RHEL4 ES U7. Four machines >>> Onboard gigabit NICs (Machines use little bandwidth, and all network >>> traffic including DLM share NICs) >>> QLogic 2462 PCI-Express dual channel FC HBAs >>> QLogic SANBox 5200 FC switch >>> Apple XRAID which presents as two LUNs (~4.5TB raw aggregate) >>> Cisco Catalyst switch >>> >>> Simple four machine RHEL4 U7 cluster running kernel 2.6.9-78.0.1.ELsmp >>> x86_64 with the following packages: >>> ccs-1.0.12-1 >>> cman-1.0.24-1 >>> cman-kernel-smp-2.6.9-55.13.el4_7.1 >>> cman-kernheaders-2.6.9-55.13.el4_7.1 >>> dlm-kernel-smp-2.6.9-54.11.el4_7.1 >>> dlm-kernheaders-2.6.9-54.11.el4_7.1 >>> fence-1.32.63-1.el4_7.1 >>> GFS-6.1.18-1 >>> GFS-kernel-smp-2.6.9-80.9.el4_7.1 >>> >>> One clustered VG. Striped across two physical volumes, which >>> correspond to each side of an Apple XRAID. >>> Clustered volume group info: >>> --- Volume group --- >>> VG Name hq-san >>> System ID >>> Format lvm2 >>> Metadata Areas 2 >>> Metadata Sequence No 50 >>> VG Access read/write >>> VG Status resizable >>> Clustered yes >>> Shared no >>> MAX LV 0 >>> Cur LV 3 >>> Open LV 3 >>> Max PV 0 >>> Cur PV 2 >>> Act PV 2 >>> VG Size 4.55 TB >>> PE Size 4.00 MB >>> Total PE 1192334 >>> Alloc PE / Size 905216 / 3.45 TB >>> Free PE / Size 287118 / 1.10 TB >>> VG UUID hfeIhf-fzEq-clCf-b26M-cMy3-pphm-B6wmLv >>> >>> Logical volumes contained with hq-san VG: >>> cam_development hq-san -wi-ao 500.00G >>> qa hq-san -wi-ao 1.07T >>> svn_users hq-san -wi-ao 1.89T >>> >>> All four machines mount svn_users, two machines mount qa, and one >>> mounts cam_development. >>> >>> /etc/cluster/cluster.conf: >>> >>> <?xml version="1.0"?> >>> <cluster alias="tungsten" config_version="31" name="qualia"> >>> <fence_daemon post_fail_delay="0" post_join_delay="3"/> >>> <clusternodes> >>> <clusternode name="odin" votes="1"> >>> <fence> >>> <method name="1"> >>> <device modulename="" name="odin-drac"/> >>> </method> >>> </fence> >>> </clusternode> >>> <clusternode name="hugin" votes="1"> >>> <fence> >>> <method name="1"> >>> <device modulename="" name="hugin-drac"/> >>> </method> >>> </fence> >>> </clusternode> >>> <clusternode name="munin" votes="1"> >>> <fence> >>> <method name="1"> >>> <device modulename="" name="munin-drac"/> >>> </method> >>> </fence> >>> </clusternode> >>> <clusternode name="zeus" votes="1"> >>> <fence> >>> <method name="1"> >>> <device modulename="" name="zeus-drac"/> >>> </method> >>> </fence> >>> </clusternode> >>> </clusternodes> >>> <cman expected_votes="1" two_node="0"/> >>> <fencedevices> >>> <resources/> >>> <fencedevice name="odin-drac" agent="fence_drac" >>> ipaddr="redacted" login="root" passwd="redacted"/> >>> <fencedevice name="hugin-drac" agent="fence_drac" >>> ipaddr="redacted" login="root" passwd="redacted"/> >>> <fencedevice name="munin-drac" agent="fence_drac" >>> ipaddr="redacted" login="root" passwd="redacted"/> >>> <fencedevice name="zeus-drac" agent="fence_drac" >>> ipaddr="redacted" login="root" passwd="redacted"/> >>> </fencedevices> >>> <rm> >>> <failoverdomains/> >>> <resources/> >>> </rm> >>> </cluster> >>> >>> >>> >>> >>> -- >>> Shawn Hood >>> 910.670.1819 m >>> >> >> >> >> -- >> Shawn Hood >> 910.670.1819 m >> > > > > -- > Shawn Hood > 910.670.1819 m > -- Shawn Hood 910.670.1819 m -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster