Re: GFS2 with IMAP Maildir server

Gordan Bobic <gordan@xxxxxxxxxx> · Fri, 03 Jul 2009 20:40:11 +0100

Sounds like you are running into the same bug that I ran into with GFS2 
on a similar setup nearly 2 years ago, except I could produce a lock-up 
in under 2 seconds every time. Solution is to use GFS1 if you really 
want to stick with that setup, but bear in mind that, regardless of the 
cluster file system (GFS1, GFS2, OCFS2) the performance will scale 
_inversely_. Cluster file systems really don't work well with millions 
of small files.

You might, instead, want to look into something like DBMail with a MySQL 
proxy to serialize all writes to a single node.

You can, of course, still use GFS1 for the root file system to share the 
OS install. Look at Open Shared Root project if this is of interest.

Gordan

Flavio Junior wrote:
Hi folks....

I'm (trying to) using GFS2 with a mailserver scenario using:

- CentOS 5.3 updated
- Dovecot IMAP/Maildir
- Postfix

To make servers active/active i'm using CTDB (http://ctdb.samba.org).

Some info that could be relevant:
[root@pinky ~]# uname -a
Linux pinky 2.6.18-128.1.16.el5 #1 SMP Tue Jun 30 06:07:26 EDT 2009 
x86_64 x86_64 x86_64 GNU/Linux
[root@pinky ~]# rpm -qa | grep -E 'gfs2|clust|kernel|cman|openais'
kernel-2.6.18-128.1.16.el5
gfs2-utils-0.1.53-1.el5_3.3
modcluster-0.12.1-2.el5.centos
cluster-cim-0.12.1-2.el5.centos
kernel-devel-2.6.18-128.1.10.el5
openais-0.80.3-22.el5_3.8
system-config-cluster-1.0.55-1.0
kernel-2.6.18-128.1.6.el5
kernel-2.6.18-128.1.10.el5
kernel-devel-2.6.18-128.1.16.el5
lvm2-cluster-2.02.40-7.el5
cluster-snmp-0.12.1-2.el5.centos
kernel-headers-2.6.18-128.1.16.el5
kernel-devel-2.6.18-128.1.6.el5
cman-2.0.98-1.el5_3.4
[root@pinky ~]# grep /home /etc/fstab
/dev/homeClusterVG/home_vmail   /home           gfs2    
auto,noatime,quota=off,noexec,nodev,_netdev       0 0

Everything works fine for some time, but two or three times by day I get 
some dovecot/deliver process hanged D state, so the only way to solve it 
is rebooting node.

I'm not a developer and don't know much about debugging. As i've got 
other problems ago I learn to use "sysrq-t" and here is the output 
related with two of these process:

Pastebin: http://pastebin.ca/1483264

Jul  3 15:45:20 cerebro kernel: deliver       D ffff81007e442800     0 
24420  23846                     (NOTLB)
Jul  3 15:45:20 cerebro kernel:  ffff810013885e08 0000000000000082 
ffff810013885d68 0000000000000092
Jul  3 15:45:20 cerebro kernel:  ffff810013885e20 0000000000000001 
ffff8100141870c0 ffff81000904b0c0
Jul  3 15:45:20 cerebro kernel:  0000052a72ff2a70 000000000000034a 
ffff8100141872a8 000000036caf5000
Jul  3 15:45:20 cerebro kernel: Call Trace:
Jul  3 15:45:20 cerebro kernel:  [<ffffffff88562a7d>] 
:dlm:dlm_posix_lock+0x172/0x210
Jul  3 15:45:20 cerebro kernel:  [<ffffffff8009eba4>] 
autoremove_wake_function+0x0/0x2e
Jul  3 15:45:20 cerebro kernel:  [<ffffffff88591c7a>] 
:gfs2:gfs2_lock+0xc3/0xcf
Jul  3 15:45:20 cerebro kernel:  [<ffffffff8003a39e>] 
fcntl_setlk+0x11e/0x273
Jul  3 15:45:20 cerebro kernel:  [<ffffffff800b5659>] 
audit_syscall_entry+0x16e/0x1a1
Jul  3 15:45:20 cerebro kernel:  [<ffffffff8002ea66>] sys_fcntl+0x269/0x2dc
Jul  3 15:45:20 cerebro kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0

Jul  3 15:45:21 cerebro kernel: deliver       D ffff81000238f480     0  
1358  32225                     (NOTLB)
Jul  3 15:45:21 cerebro kernel:  ffff8100086cfe08 0000000000000082 
ffff8100086cfd68 0000000000000092
Jul  3 15:45:21 cerebro kernel:  ffff8100086cfe20 0000000000000001 
ffff81000904b0c0 ffff81007ff28100
Jul  3 15:45:21 cerebro kernel:  0000052a72ff2ca2 0000000000000232 
ffff81000904b2a8 000000037ed68a00
Jul  3 15:45:21 cerebro kernel: Call Trace:
Jul  3 15:45:21 cerebro kernel:  [<ffffffff88562a7d>] 
:dlm:dlm_posix_lock+0x172/0x210
Jul  3 15:45:21 cerebro kernel:  [<ffffffff8009eba4>] 
autoremove_wake_function+0x0/0x2e
Jul  3 15:45:21 cerebro kernel:  [<ffffffff88591c7a>] 
:gfs2:gfs2_lock+0xc3/0xcf
Jul  3 15:45:21 cerebro kernel:  [<ffffffff8003a39e>] 
fcntl_setlk+0x11e/0x273
Jul  3 15:45:21 cerebro kernel:  [<ffffffff800b5659>] 
audit_syscall_entry+0x16e/0x1a1
Jul  3 15:45:21 cerebro kernel:  [<ffffffff8002ea66>] sys_fcntl+0x269/0x2dc
Jul  3 15:45:21 cerebro kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0

Before reboot the node I went into the directory of this user and run 
some "ls" and everything works as expected. I was pretty sure that 
command will hang, but it don't.
Here is the "ps ax" output:
cicero   24420  0.0  0.0   8960  1220 ?        Ds   14:46   0:00 
/usr/libexec/dovecot/deliver -f cicero -d cicero

I've already rebooted that node, but if there is someway more deeply to 
perform a debug of this case, just let me know that probably till the 
end of the day i'll get same situation.

Thanks in advance.

--

Flávio do Carmo Júnior aka waKKu

------------------------------------------------------------------------

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster