I'm (trying to) using GFS2 with a mailserver scenario using:
- CentOS 5.3 updated
- Dovecot IMAP/Maildir
- Postfix
To make servers active/active i'm using CTDB (http://ctdb.samba.org).
Some info that could be relevant:
[root@pinky ~]# uname -a
Linux pinky 2.6.18-128.1.16.el5 #1 SMP Tue Jun 30 06:07:26 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
[root@pinky ~]# rpm -qa | grep -E 'gfs2|clust|kernel|cman|openais'
kernel-2.6.18-128.1.16.el5
gfs2-utils-0.1.53-1.el5_3.3
modcluster-0.12.1-2.el5.centos
cluster-cim-0.12.1-2.el5.centos
kernel-devel-2.6.18-128.1.10.el5
openais-0.80.3-22.el5_3.8
system-config-cluster-1.0.55-1.0
kernel-2.6.18-128.1.6.el5
kernel-2.6.18-128.1.10.el5
kernel-devel-2.6.18-128.1.16.el5
lvm2-cluster-2.02.40-7.el5
cluster-snmp-0.12.1-2.el5.centos
kernel-headers-2.6.18-128.1.16.el5
kernel-devel-2.6.18-128.1.6.el5
cman-2.0.98-1.el5_3.4
[root@pinky ~]# grep /home /etc/fstab
/dev/homeClusterVG/home_vmail /home gfs2 auto,noatime,quota=off,noexec,nodev,_netdev 0 0
Everything works fine for some time, but two or three times by day I get some dovecot/deliver process hanged D state, so the only way to solve it is rebooting node.
I'm not a developer and don't know much about debugging. As i've got other problems ago I learn to use "sysrq-t" and here is the output related with two of these process:
Pastebin: http://pastebin.ca/1483264
Jul 3 15:45:20 cerebro kernel: deliver D ffff81007e442800 0 24420 23846 (NOTLB)
Jul 3 15:45:20 cerebro kernel: ffff810013885e08 0000000000000082 ffff810013885d68 0000000000000092
Jul 3 15:45:20 cerebro kernel: ffff810013885e20 0000000000000001 ffff8100141870c0 ffff81000904b0c0
Jul 3 15:45:20 cerebro kernel: 0000052a72ff2a70 000000000000034a ffff8100141872a8 000000036caf5000
Jul 3 15:45:20 cerebro kernel: Call Trace:
Jul 3 15:45:20 cerebro kernel: [<ffffffff88562a7d>] :dlm:dlm_posix_lock+0x172/0x210
Jul 3 15:45:20 cerebro kernel: [<ffffffff8009eba4>] autoremove_wake_function+0x0/0x2e
Jul 3 15:45:20 cerebro kernel: [<ffffffff88591c7a>] :gfs2:gfs2_lock+0xc3/0xcf
Jul 3 15:45:20 cerebro kernel: [<ffffffff8003a39e>] fcntl_setlk+0x11e/0x273
Jul 3 15:45:20 cerebro kernel: [<ffffffff800b5659>] audit_syscall_entry+0x16e/0x1a1
Jul 3 15:45:20 cerebro kernel: [<ffffffff8002ea66>] sys_fcntl+0x269/0x2dc
Jul 3 15:45:20 cerebro kernel: [<ffffffff8005e28d>] tracesys+0xd5/0xe0
Jul 3 15:45:21 cerebro kernel: deliver D ffff81000238f480 0 1358 32225 (NOTLB)
Jul 3 15:45:21 cerebro kernel: ffff8100086cfe08 0000000000000082 ffff8100086cfd68 0000000000000092
Jul 3 15:45:21 cerebro kernel: ffff8100086cfe20 0000000000000001 ffff81000904b0c0 ffff81007ff28100
Jul 3 15:45:21 cerebro kernel: 0000052a72ff2ca2 0000000000000232 ffff81000904b2a8 000000037ed68a00
Jul 3 15:45:21 cerebro kernel: Call Trace:
Jul 3 15:45:21 cerebro kernel: [<ffffffff88562a7d>] :dlm:dlm_posix_lock+0x172/0x210
Jul 3 15:45:21 cerebro kernel: [<ffffffff8009eba4>] autoremove_wake_function+0x0/0x2e
Jul 3 15:45:21 cerebro kernel: [<ffffffff88591c7a>] :gfs2:gfs2_lock+0xc3/0xcf
Jul 3 15:45:21 cerebro kernel: [<ffffffff8003a39e>] fcntl_setlk+0x11e/0x273
Jul 3 15:45:21 cerebro kernel: [<ffffffff800b5659>] audit_syscall_entry+0x16e/0x1a1
Jul 3 15:45:21 cerebro kernel: [<ffffffff8002ea66>] sys_fcntl+0x269/0x2dc
Jul 3 15:45:21 cerebro kernel: [<ffffffff8005e28d>] tracesys+0xd5/0xe0
Before reboot the node I went into the directory of this user and run some "ls" and everything works as expected. I was pretty sure that command will hang, but it don't.
Here is the "ps ax" output:
cicero 24420 0.0 0.0 8960 1220 ? Ds 14:46 0:00 /usr/libexec/dovecot/deliver -f cicero -d cicero
I've already rebooted that node, but if there is someway more deeply to perform a debug of this case, just let me know that probably till the end of the day i'll get same situation.
Thanks in advance.
--
Flávio do Carmo Júnior aka waKKu
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster