Hi Steve, I think lately I observed a very similar behavior with RHEL5 and gfs2. It was a gfs2 filesystem that had about 2Mio files with sum of 2GB in a directory. When I did a du -shx . in this directory it took about 5 Minutes (noatime mountoption given). Independently on how much nodes took part in the cluster (in the end I only tested with one node). This was only for the first time running all later executed du commands were much faster. When I mounted the exact same filesystem with lockproto=lock_nolock it took about 10-20 seconds to proceed with the same command. Next I started to analyze this with oprofile and observed the following result: opreport --long-file-names: CPU: AMD64 family10, speed 2900.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 200569 46.7639 search_rsb_list 118905 27.7234 create_lkb 32499 7.5773 search_bucket 4125 0.9618 find_lkb 3641 0.8489 process_send_sockets 3420 0.7974 dlm_scan_rsbs 3184 0.7424 _request_lock 3012 0.7023 find_rsb 2735 0.6377 receive_from_sock 2610 0.6085 _receive_message 2543 0.5929 dlm_allocate_rsb 2299 0.5360 dlm_hash2nodeid 2228 0.5195 _create_message 2180 0.5083 dlm_astd 2163 0.5043 dlm_find_lockspace_global 2109 0.4917 dlm_find_lockspace_local 2074 0.4836 dlm_lowcomms_get_buffer 2060 0.4803 dlm_lock 1982 0.4621 put_rsb .. opreport --image /gfs2 CPU: AMD64 family10, speed 2900.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 9310 15.5600 search_bucket 6268 10.4758 do_promote 2704 4.5192 gfs2_glock_put 2289 3.8256 gfs2_glock_hold 2286 3.8206 gfs2_glock_schedule_for_reclaim 2204 3.6836 gfs2_glock_nq 2204 3.6836 run_queue 2001 3.3443 gfs2_holder_wake .. opreport --image /dlm CPU: AMD64 family10, speed 2900.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 200569 46.7639 search_rsb_list 118905 27.7234 create_lkb 32499 7.5773 search_bucket 4125 0.9618 find_lkb 3641 0.8489 process_send_sockets 3420 0.7974 dlm_scan_rsbs 3184 0.7424 _request_lock 3012 0.7023 find_rsb 2735 0.6377 receive_from_sock 2610 0.6085 _receive_message 2543 0.5929 dlm_allocate_rsb 2299 0.5360 dlm_hash2nodeid 2228 0.5195 _create_message .. This very much reminded me on a similar test we've done years ago with gfs (see http://www.open-sharedroot.org/Members/marc/blog/blog-on-dlm/red-hat-dlm-__find_lock_by_id/profile-data-with-diffrent-table-sizes). Does this not show that during the du command 46% of the time the kernel stays in the dlm:search_rsb_list function while looking out for locks. It still looks like the hashtable for the lock in dlm is much too small and searching inside the hashmap is not constant anymore? I would be really interesting how long the described backup takes when the gfs2 filesystem is mounted exclusively on one node without locking. For me it looks like you're facing a similar problem with gfs2 that has been worked around with gfs by introducing the glock_purge functionality that leads to a much smaller glock->dlm->hashtable and makes backups and the like much faster. I hope this helps. Thanks and regards Marc. ----- Original Message ----- From: "Steven Whitehouse" <swhiteho@xxxxxxxxxx> To: "linux clustering" <linux-cluster@xxxxxxxxxx> Sent: Dienstag, 15. Februar 2011 19:20:20 Subject: Re: optimising DLM speed? Hi, On Tue, 2011-02-15 at 17:59 +0000, Alan Brown wrote: > After lots of headbanging, I'm slowly realising that limits on GFS2 lock > rates and totem message passing appears to be the main inhibitor of > cluster performance. > > Even on disks which are only mounted on one node (using lock_dlm), the > ping_pong rate is - quite frankly - appalling, at about 5000 > locks/second, falling off to single digits when 3 nodes are active on > the same directory. > Let me try and explain what is going on here.... the posix (fcntl) locks which you are using, do not go through the dlm, or at least not the main part of the dlm. The lock requests are sent to either gfs_controld or dlm_controld, depending upon the version of RHCS where the requests are processed in userspace via corosync/openais. > totem's defaults are pretty low: > > (from man openais.conf) > > max messages/second = 17 > window_size = 50 > encryption = on > encryption/decryption threads = 1 > netmtu = 1500 > > I suspect tuning these would have a marked effect on performance > > gfs_controld and dlm_controld aren't even appearing in the CPU usage > tables (24Gb dual 5560CPUs) > Only one of gfs_controld/dlm_controld will have any part in dealing with the locks that you are concerned with, depending on the version. > We have 2 GFS clusters, 2 nodes (imap) and 3 nodes (fileserving) > > The imap system has around 2.5-3 million small files in the Maildir imap > tree, whilst the fileserver cluster has ~90 1Tb filesystems of 1-4 > million files apiece (fileserver total is around 150 million files) > > When things get busy or when users get silly and drop 10,000 files in a > directory, performance across the entire cluster goes downhill badly - > not just in the affected disk or directory. > > Even worse: backups - it takes 20-28 hours to run a 0 file incremental > backup of a 2.1million file system (ext4 takes about 8 minutes for the > same file set!) > The issues you've reported here don't sound to me as if they are related to the rate of posix locks which can be granted. These sound to me a lot more like issues relating to the I/O pattern on the filesystem. How is the data spread out across directories and across nodes? Do you try to keep users local to a single node for the imap servers? Is the backup just doing a single pass scan over the whole fileystem? > > All heartbeat/lock traffic is handled across a dedicated Gb switch with > each cluster in its own vlan to ensure no external cruft gets in to > cause problems. > > I'm seeing heartbeat/lock lan traffic peak out at about 120kb/s and > 4000pps per node at the moment. Clearly the switch isn't the problem - > and using hardware acclerated igb devices I'm pretty sure the > networking's fine too. > During the actual workload, or just during the ping pong test? Steve. > SAN side, there are 4 8Gb Qlogic cards facing the fabric and right now > the whole mess talks to a Nexsan atabeast (which is slow, but seldom > gets its commmand queue maxed out.) > > Has anyone played much with the totem message timings? if so what > results have you had? > > As a comparison, the same hardware using EXT4 on a standalone system can > trivially max out multiple 1Gb/s interfaces while transferring 1-2Mb/s > files and gives lock rates of 1.8-2.5 million locks/second even with > multiple ping_pong processes running. > > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Marc Grimme Tel: +49 89 4523538-14 Fax: +49 89 9901766-0 E-Mail: grimme@xxxxxxx ATIX Informationstechnologie und Consulting AG | Einsteinstrasse 10 | 85716 Unterschleissheim | www.atix.de Registergericht: Amtsgericht Muenchen, Registernummer: HRB 168930, USt.-Id.: DE209485962 | Vorstand: Thomas Merz (Vors.), Marc Grimme, Mark Hlawatschek, Jan R. Bergrath | Vorsitzender des Aufsichtsrats: Dr. Martin Buss -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster