Hi, On Thu, 2009-09-24 at 11:30 +0100, Gavin Conway wrote: > Hi All, > > > > > > We have 6 nodes running GFS2 under CentOS 5.3 all connecting via > Cisco 2960G switches to an MD3000i with 8 x 146GB SAS 15K drives. > These nodes run a PHP website pulling their PHP and images files from > a GFS2 volume being exported by iSCSI from the MD3000i . > > > > Problem we have is that since inception we’ve seen issues whereby the > HTTPD processes will go into a state of ‘D’, zombied’ and the only way > we have to recover from that is to restart all the nodes in the > cluster. > > > > I’ve tuned the demote_secs down from 300 to 20 seconds on the > assumption that file locking is causing an issue. That is unlikely to make any meaningful change and in fact it could well hurt performance, depending on the workload. > > > Similarly we’re running with the following GFS values; > > > > <gfs_controld plock_ownership="1" plock_rate_limit="0"/> > Try turning off plock_ownership and see if that fixes the problem > > > Can anyone give me some pointers on what we should be investigating > for why this is failing? I’ve had our networks team crawl over the > networking and that all seems fine. The MTU is set correctly on the > MD3000i and on the individual nodes. I’ve also used the ping_pong tool > and on a single file on the GFS cluster we can get around 90K locks on > a file. If I run ping_pong against the same file from two nodes that > then drops to around 70 locks per second. I don’t think that’s the > issue though. > > > > If anyone can provide some insight to either what to change, what to > debug or how to investigate this further it’d be greatly appreciated. > > There are two things to look at. One is back traces from processes (echo 't' > proc/sysrq-trigger) and the other is the glock dump from /sys/kernel/debug/fs/gfs2/glocks. The first tells us what is hanging and the second (hopefully) why. Look for glocks with 'W' in the flags field (f:) for their holders (H:) and it should be possible to correlate them with the processes which are stuck. Do you get any messages in the syslog? Steve. > > > > Thanks > Gavin > > > > > Gavin Conway > > Senior Engineer, Operations (Systems Group), UKSolutions > > > > Telephone: 0845 004 1333, option 2 > > Email: gavin.conway@xxxxxxxxxxxxxxxxx > > Web: www.uksolutions.co.uk > > UKS Ltd, Birmingham Road, Studley, Warwickshire, B80 7BG Registered in > England Number 3036806 > > This email must be read in conjunction with the legal & service > notices on http://www.uksolutions.co.uk/disclaimer.html > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster