Sounds like you are running into the same bug that I ran into with GFS2
on a similar setup nearly 2 years ago, except I could produce a lock-up
in under 2 seconds every time. Solution is to use GFS1 if you really
want to stick with that setup, but bear in mind that, regardless of the
cluster file system (GFS1, GFS2, OCFS2) the performance will scale
_inversely_. Cluster file systems really don't work well with millions
of small files.
You might, instead, want to look into something like DBMail with a MySQL
proxy to serialize all writes to a single node.
You can, of course, still use GFS1 for the root file system to share the
OS install. Look at Open Shared Root project if this is of interest.
Gordan
Flavio Junior wrote:
Hi folks....
I'm (trying to) using GFS2 with a mailserver scenario using:
- CentOS 5.3 updated
- Dovecot IMAP/Maildir
- Postfix
To make servers active/active i'm using CTDB (http://ctdb.samba.org).
Some info that could be relevant:
[root@pinky ~]# uname -a
Linux pinky 2.6.18-128.1.16.el5 #1 SMP Tue Jun 30 06:07:26 EDT 2009
x86_64 x86_64 x86_64 GNU/Linux
[root@pinky ~]# rpm -qa | grep -E 'gfs2|clust|kernel|cman|openais'
kernel-2.6.18-128.1.16.el5
gfs2-utils-0.1.53-1.el5_3.3
modcluster-0.12.1-2.el5.centos
cluster-cim-0.12.1-2.el5.centos
kernel-devel-2.6.18-128.1.10.el5
openais-0.80.3-22.el5_3.8
system-config-cluster-1.0.55-1.0
kernel-2.6.18-128.1.6.el5
kernel-2.6.18-128.1.10.el5
kernel-devel-2.6.18-128.1.16.el5
lvm2-cluster-2.02.40-7.el5
cluster-snmp-0.12.1-2.el5.centos
kernel-headers-2.6.18-128.1.16.el5
kernel-devel-2.6.18-128.1.6.el5
cman-2.0.98-1.el5_3.4
[root@pinky ~]# grep /home /etc/fstab
/dev/homeClusterVG/home_vmail /home gfs2
auto,noatime,quota=off,noexec,nodev,_netdev 0 0
Everything works fine for some time, but two or three times by day I get
some dovecot/deliver process hanged D state, so the only way to solve it
is rebooting node.
I'm not a developer and don't know much about debugging. As i've got
other problems ago I learn to use "sysrq-t" and here is the output
related with two of these process:
Pastebin: http://pastebin.ca/1483264
Jul 3 15:45:20 cerebro kernel: deliver D ffff81007e442800 0
24420 23846 (NOTLB)
Jul 3 15:45:20 cerebro kernel: ffff810013885e08 0000000000000082
ffff810013885d68 0000000000000092
Jul 3 15:45:20 cerebro kernel: ffff810013885e20 0000000000000001
ffff8100141870c0 ffff81000904b0c0
Jul 3 15:45:20 cerebro kernel: 0000052a72ff2a70 000000000000034a
ffff8100141872a8 000000036caf5000
Jul 3 15:45:20 cerebro kernel: Call Trace:
Jul 3 15:45:20 cerebro kernel: [<ffffffff88562a7d>]
:dlm:dlm_posix_lock+0x172/0x210
Jul 3 15:45:20 cerebro kernel: [<ffffffff8009eba4>]
autoremove_wake_function+0x0/0x2e
Jul 3 15:45:20 cerebro kernel: [<ffffffff88591c7a>]
:gfs2:gfs2_lock+0xc3/0xcf
Jul 3 15:45:20 cerebro kernel: [<ffffffff8003a39e>]
fcntl_setlk+0x11e/0x273
Jul 3 15:45:20 cerebro kernel: [<ffffffff800b5659>]
audit_syscall_entry+0x16e/0x1a1
Jul 3 15:45:20 cerebro kernel: [<ffffffff8002ea66>] sys_fcntl+0x269/0x2dc
Jul 3 15:45:20 cerebro kernel: [<ffffffff8005e28d>] tracesys+0xd5/0xe0
Jul 3 15:45:21 cerebro kernel: deliver D ffff81000238f480 0
1358 32225 (NOTLB)
Jul 3 15:45:21 cerebro kernel: ffff8100086cfe08 0000000000000082
ffff8100086cfd68 0000000000000092
Jul 3 15:45:21 cerebro kernel: ffff8100086cfe20 0000000000000001
ffff81000904b0c0 ffff81007ff28100
Jul 3 15:45:21 cerebro kernel: 0000052a72ff2ca2 0000000000000232
ffff81000904b2a8 000000037ed68a00
Jul 3 15:45:21 cerebro kernel: Call Trace:
Jul 3 15:45:21 cerebro kernel: [<ffffffff88562a7d>]
:dlm:dlm_posix_lock+0x172/0x210
Jul 3 15:45:21 cerebro kernel: [<ffffffff8009eba4>]
autoremove_wake_function+0x0/0x2e
Jul 3 15:45:21 cerebro kernel: [<ffffffff88591c7a>]
:gfs2:gfs2_lock+0xc3/0xcf
Jul 3 15:45:21 cerebro kernel: [<ffffffff8003a39e>]
fcntl_setlk+0x11e/0x273
Jul 3 15:45:21 cerebro kernel: [<ffffffff800b5659>]
audit_syscall_entry+0x16e/0x1a1
Jul 3 15:45:21 cerebro kernel: [<ffffffff8002ea66>] sys_fcntl+0x269/0x2dc
Jul 3 15:45:21 cerebro kernel: [<ffffffff8005e28d>] tracesys+0xd5/0xe0
Before reboot the node I went into the directory of this user and run
some "ls" and everything works as expected. I was pretty sure that
command will hang, but it don't.
Here is the "ps ax" output:
cicero 24420 0.0 0.0 8960 1220 ? Ds 14:46 0:00
/usr/libexec/dovecot/deliver -f cicero -d cicero
I've already rebooted that node, but if there is someway more deeply to
perform a debug of this case, just let me know that probably till the
end of the day i'll get same situation.
Thanks in advance.
--
Flávio do Carmo Júnior aka waKKu
------------------------------------------------------------------------
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster