Hello everyone!
I've set up a cluster in order to use GFS2. The cluster works really well ;)
Then, I've exported the GFS2 filesystem via NFS to share with machines
outside the cluster, and in a read fashion it works OK, but as soon as I
try to write in it, the filesystem seems to hang:
root@file03:~# mount filepro01:/mnt/gfs /mnt/tmp -o soft
root@file03:~# ls /mnt/tmp/
algo caca caca2 testa
root@file03:~# mkdir /mnt/tmp/otracosa
at this point, the NFS stopped working. I can see in the nfs client:
[11132241.127470] nfs: server filepro01 not responding, timed out
however, the directory was indeed created, and the other node can
continue using the gfs2 filesystem (locally)
On the NFS server (filepro01) looking at the logs I found some nasty
things. This first part is mounting the filesystem, which is OK:
[6234925.738508] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state
recovery directory
[6234925.787305] NFSD: starting 90-second grace period
[6234925.825811] GFS2 (built Feb 7 2011 16:11:33) installed
[6234925.826698] GFS2: fsid=: Trying to join cluster "lock_dlm",
"wtn_cluster:file01"
[6234925.886991] GFS2: fsid=wtn_cluster:file01.0: Joined cluster. Now
mounting FS...
[6234925.975113] GFS2: fsid=wtn_cluster:file01.0: jid=0, already locked
for use
[6234925.975116] GFS2: fsid=wtn_cluster:file01.0: jid=0: Looking at
journal...
[6234926.075105] GFS2: fsid=wtn_cluster:file01.0: jid=0: Acquiring the
transaction lock...
[6234926.075152] GFS2: fsid=wtn_cluster:file01.0: jid=0: Replaying
journal...
[6234926.076200] GFS2: fsid=wtn_cluster:file01.0: jid=0: Replayed 8 of 9
blocks
[6234926.076204] GFS2: fsid=wtn_cluster:file01.0: jid=0: Found 1 revoke tags
[6234926.076649] GFS2: fsid=wtn_cluster:file01.0: jid=0: Journal
replayed in 1s
[6234926.076800] GFS2: fsid=wtn_cluster:file01.0: jid=0: Done
[6234926.076945] GFS2: fsid=wtn_cluster:file01.0: jid=1: Trying to
acquire journal lock...
[6234926.078723] GFS2: fsid=wtn_cluster:file01.0: jid=1: Looking at
journal...
[6234926.257645] GFS2: fsid=wtn_cluster:file01.0: jid=1: Done
[6234926.258187] GFS2: fsid=wtn_cluster:file01.0: jid=2: Trying to
acquire journal lock...
[6234926.260966] GFS2: fsid=wtn_cluster:file01.0: jid=2: Looking at
journal...
[6234926.549636] GFS2: fsid=wtn_cluster:file01.0: jid=2: Done
[6234930.789787] ipmi message handler version 39.2
and when we try to write from nfs client, bang:
[6235083.656954] BUG: unable to handle kernel NULL pointer dereference
at 00000024
[6235083.656973] IP: [<ee2d6c1e>] gfs2_drevalidate+0xe/0x200 [gfs2]
[6235083.656992] *pdpt = 0000000001831027 *pde = 0000000000000000
[6235083.657003] Oops: 0000 [#1] SMP
[6235083.657012] last sysfs file: /sys/module/dlm/initstate
[6235083.657018] Modules linked in: ipmi_msghandler xenfs gfs2 ib_iser
rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp
libiscsi scsi_transport_iscsi dlm configfs nfsd e
xportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc drbd lru_cache lp
parport [last unloaded: scsi_transport_iscsi]
[6235083.657090]
[6235083.657095] Pid: 1497, comm: nfsd Tainted: G W
2.6.38-2-virtual #29~lucid1-Ubuntu /
[6235083.657103] EIP: 0061:[<ee2d6c1e>] EFLAGS: 00010282 CPU: 0
[6235083.657115] EIP is at gfs2_drevalidate+0xe/0x200 [gfs2]
[6235083.657120] EAX: eb9d7180 EBX: eb9d7180 ECX: ee2ec000 EDX: 00000000
[6235083.657127] ESI: eb924580 EDI: 00000000 EBP: c1dc5c68 ESP: c1dc5c20
[6235083.657133] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
[6235083.657139] Process nfsd (pid: 1497, ti=c1dc4000 task=c1b18ca0
task.ti=c1dc4000)
[6235083.657145] Stack:
[6235083.657150] c1dc5c28 c0627afd c1dc5c68 c0242314 00000000 c1dc5c7c
ee2dba0c ee2c02d0
[6235083.657170] 00000001 eb924580 c1a47038 c1dc5cb0 eb9d7188 00000004
14a2fc97 eb9d7180
[6235083.657190] eb924580 00000000 c1dc5c7c c023a18f eb9d7180 eb924580
eb925000 c1dc5ca0
[6235083.657210] Call Trace:
[6235083.657220] [<c0627afd>] ? _raw_spin_lock+0xd/0x10
[6235083.657230] [<c0242314>] ? __d_lookup+0xf4/0x150
[6235083.657242] [<ee2dba0c>] ? gfs2_permission+0xcc/0x120 [gfs2]
[6235083.657253] [<ee2c02d0>] ? gfs2_check_acl+0x0/0x80 [gfs2]
[6235083.657263] [<c023a18f>] d_revalidate+0x1f/0x60
[6235083.657271] [<c023a2e2>] __lookup_hash+0xa2/0x180
[6235083.657284] [<edd8e266>] ? encode_post_op_attr+0x86/0x90 [nfsd]
[6235083.657292] [<c023a4c3>] lookup_one_len+0x43/0x80
[6235083.657303] [<edd8d13f>] compose_entry_fh+0x9f/0xe0 [nfsd]
[6235083.657315] [<edd8e491>] encode_entryplus_baggage+0x51/0xb0 [nfsd]
[6235083.657327] [<edd8e795>] encode_entry+0x2a5/0x2f0 [nfsd]
[6235083.657338] [<edd8e820>] nfs3svc_encode_entry_plus+0x40/0x50 [nfsd]
[6235083.657349] [<edd8366d>] nfsd_buffered_readdir+0xfd/0x1a0 [nfsd]
[6235083.657361] [<edd8e7e0>] ? nfs3svc_encode_entry_plus+0x0/0x50 [nfsd]
[6235083.657372] [<edd852a0>] nfsd_readdir+0x70/0xb0 [nfsd]
[6235083.657383] [<edd8bd58>] nfsd3_proc_readdirplus+0xd8/0x200 [nfsd]
[6235083.657394] [<edd8e7e0>] ? nfs3svc_encode_entry_plus+0x0/0x50 [nfsd]
[6235083.657405] [<edd7f3a3>] nfsd_dispatch+0xd3/0x210 [nfsd]
[6235083.657423] [<edd0fd83>] svc_process_common+0x2e3/0x590 [sunrpc]
[6235083.657438] [<edd1c86d>] ? svc_xprt_received+0x2d/0x40 [sunrpc]
[6235083.657452] [<edd1cd0b>] ? svc_recv+0x48b/0x750 [sunrpc]
[6235083.657465] [<edd1010c>] svc_process+0xdc/0x140 [sunrpc]
[6235083.657474] [<c0627010>] ? down_read+0x10/0x20
[6235083.657483] [<edd7fa54>] nfsd+0xb4/0x140 [nfsd]
[6235083.657493] [<c0143b9e>] ? complete+0x4e/0x60
[6235083.657503] [<edd7f9a0>] ? nfsd+0x0/0x140 [nfsd]
[6235083.657513] [<c0173354>] kthread+0x74/0x80
[6235083.657520] [<c01732e0>] ? kthread+0x0/0x80
[6235083.657528] [<c010af3e>] kernel_thread_helper+0x6/0x10
[6235083.657533] Code: 8b 53 08 e8 75 d4 0a d2 f7 d0 89 03 31 c0 5b 5d
c3 8d b6 00 00 00 00 8d bf 00 00 00 00 55 89 e5 57 56 53 83 ec 3c 3e 8d
74 26 00 <f6> 42 24 40 89 c3 b8 f6 ff ff ff 74 0d 83 c4 3c 5b 5e 5f 5d c3
[6235083.657652] EIP: [<ee2d6c1e>] gfs2_drevalidate+0xe/0x200 [gfs2]
SS:ESP 0069:c1dc5c20
[6235083.865070] CR2: 0000000000000024
[6235083.865077] ---[ end trace 2dfc9195648a185b ]---
[6235099.205542] dlm: connecting to 2
Is this a bug?
Is it known?
Are there any workarounds?
The gfs2+nfs server is a xen client, with ubuntu 10.04 and kernel
2.6.38-2-virtual
# gfs2_tool version
gfs2_tool 3.0.12 (built Jul 5 2011 16:52:20)
Copyright (C) Red Hat, Inc. 2004-2010 All rights reserved.
# cman_tool version
6.2.0 config 2011070805
Here's also the cluster.conf file, just in case ;)
<?xml version="1.0"?>
<cluster name="wtn_cluster" config_version="2011070805">
<quorumd interval="5" tko="6" label="filepro-qdisk" votes="1"/>
<cman expected_votes="3" two_node="0"/>
<totem consensus="72000" token="60000"/>
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="filepro01" votes="1" nodeid="1">
<fence>
<method name="xen">
<device name="xen" nodename="filepro01" U="abcdefghijk" action="reboot"/>
</method>
</fence>
</clusternode>
<clusternode name="filepro02" votes="1" nodeid="2">
<fence>
<method name="xen">
<device name="xen" nodename="filepro02" U="qwertyuiop" action="reboot"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="xen" agent="fence_xen"/>
</fencedevices>
</cluster>
Thanks in advance :)
--
Javi Polo
Administrador de Sistemas
Tel 93 734 97 70
Fax 93 734 97 71
jpolo@xxxxxxxxxxxxx
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster