Hello, I am having trouble sharing out a gfs filesystem via nfs. I have a two node cluster (active/passive) that is intended to provide nfs shares to a number of clients. Its appears that one of node crashes or both nodes hang when under a heavy (sustained reads and writes by 1 or more nfs clients for any length of time). The cluster appears to work fine for io directly on the nodes - for example I can run bonnie++ for several days on the nodes directly without problems, but running bonnie++ on the gfs filename over nfs causes a crash or a hang within a hour or so. The crashes result in a kernel "Opps" and require the crashed node to be reset. The hangs are a little more complicated - both nodes appear to "freeze" the gfs file system, and any gfs (gfs_tool df, umounts, etc ) related activity just hangs. I have been unable to find a clean way to recover from this situation - attempts to umount the file system just cause the umount to hang. The only way I have found to deal with the situation is to take down one of the nodes ethernet interface, so the other node notices that it is not receiving heartbeats, fences it, and then proceeds to continue on without any indication of any problems. I am using the "RHEL4 cluster" branch from cvs, and the 2.6.9-5.0.5.ELsmp kernel. I am using "lock_dlm" locking and the file system was created via: gfs_mkfs -r 1536 -j 3 -p lock_dlm -t ftp:dds_space /dev/mapper/ftp_space-erc1 My cluster configuration is pretty simple - sanbox2 fencing with two nodes and the two nodes option set (<cman two_node="1" expected_votes="1">). I would greatly appreciate any advice folks have as to what I can do to fix this problem. For the list archives it appears that other folks are serving out gfs filesystems via nfs, so this should be possible, right? I have attached the relevant part of /var/log/messages for a crash. If any additional information would be helpful, please let me know, and I will get it ( the crashes/hangs are very repeatable!). Thanks, -Jay Cable Here is the output from one of the crashes: Jun 9 19:23:46 jin kernel: send_arp uses obsolete (PF_INET,SOCK_PACKET) Jun 9 19:28:06 jin kernel: Bad page state at prep_new_page (in process 'nfsd', page c159f4e0) Jun 9 19:28:06 jin kernel: flags:0x20001020 mapping:f6a300e0 mapcount:0 count:2 Jun 9 19:28:06 jin kernel: Backtrace: Jun 9 19:28:06 jin kernel: [<c013e669>] bad_page+0x58/0x89 Jun 9 19:28:06 jin kernel: [<c013e9ec>] prep_new_page+0x24/0x3a Jun 9 19:28:06 jin kernel: [<c013eef8>] buffered_rmqueue+0x17d/0x1a5 Jun 9 19:28:06 jin kernel: [<c013efd4>] __alloc_pages+0xb4/0x298 Jun 9 19:28:06 jin kernel: [<c013baa2>] find_lock_page+0x96/0x9d Jun 9 19:28:06 jin kernel: [<c013d16d>] generic_file_buffered_write+0x10d/0x47c Jun 9 19:28:06 jin kernel: [<c013bac1>] find_or_create_page+0x18/0x72 Jun 9 19:28:06 jin kernel: [<c013b775>] wake_up_page+0x9/0x29 Jun 9 19:28:06 jin kernel: [<c013d85e>] generic_file_aio_write_nolock+0x382/0x3b0 Jun 9 19:28:06 jin kernel: [<c013d910>] generic_file_write_nolock+0x84/0x99 Jun 9 19:28:06 jin kernel: [<f8f96e5f>] gfs_glock_nq+0xe3/0x116 [gfs] Jun 9 19:28:06 jin kernel: [<c011e8d2>] autoremove_wake_function+0x0/0x2d Jun 9 19:28:06 jin kernel: [<f8fb7658>] gfs_trans_begin_i+0xfd/0x15a [gfs] Jun 9 19:28:06 jin kernel: [<f8faadd2>] do_do_write_buf+0x268/0x3b4 [gfs] Jun 9 19:28:06 jin kernel: [<f8fab02e>] do_write_buf+0x110/0x152 [gfs] Jun 9 19:28:06 jin kernel: [<f8faa238>] walk_vm+0xd3/0xf7 [gfs] Jun 9 19:28:06 jin kernel: [<f8f9709a>] gfs_glock_dq+0x111/0x11f [gfs] Jun 9 19:28:06 jin kernel: [<f8fab10d>] gfs_write+0x9d/0xb6 [gfs] Jun 9 19:28:06 jin kernel: [<f8faaf1e>] do_write_buf+0x0/0x152 [gfs] Jun 9 19:28:06 jin kernel: [<f8fab070>] gfs_write+0x0/0xb6 [gfs] Jun 9 19:28:06 jin kernel: [<c0155ba8>] do_readv_writev+0x1c5/0x21d Jun 9 19:28:06 jin kernel: [<c0154c92>] dentry_open+0xf0/0x1a5 Jun 9 19:28:06 jin kernel: [<c0155c7e>] vfs_writev+0x3e/0x43 Jun 9 19:28:06 jin kernel: [<f8c11b6b>] nfsd_write+0xeb/0x289 [nfsd] Jun 9 19:28:06 jin kernel: [<f8b2d5db>] svcauth_unix_accept+0x2d3/0x34a [sunrpc] Jun 9 19:28:06 jin kernel: [<f8c18356>] nfsd3_proc_write+0xbf/0xd5 [nfsd] Jun 9 19:28:06 jin kernel: [<f8c1a3a8>] nfs3svc_decode_writeargs+0x0/0x243 [nfsd] Jun 9 19:28:06 jin kernel: [<f8c0e5d7>] nfsd_dispatch+0xba/0x16f [nfsd] Jun 9 19:28:06 jin kernel: [<f8b2a446>] svc_process+0x420/0x6d6 [sunrpc] Jun 9 19:28:06 jin kernel: [<f8c0e3b7>] nfsd+0x1cc/0x332 [nfsd] Jun 9 19:28:06 jin kernel: [<f8c0e1eb>] nfsd+0x0/0x332 [nfsd] Jun 9 19:28:06 jin kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb Jun 9 19:28:06 jin kernel: Trying to fix it up, but a reboot is needed Jun 9 19:30:34 jin kernel: ------------[ cut here ]------------ Jun 9 19:30:34 jin kernel: kernel BUG at mm/vmscan.c:377! Jun 9 19:30:34 jin kernel: invalid operand: 0000 [#1] Jun 9 19:30:34 jin kernel: SMP Jun 9 19:30:34 jin kernel: Modules linked in: lock_dlm(U) dlm(U) cman(U) gfs(U) lock_harness(U) dm_mod qla2300 qla2xxx scsi_transport_fc nfsd exportfs lockd autofs4 i2c_dev i2c_core md5 ipv6 sunrpc ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables button battery ac uhci_hcd ehci_hcd e1000 floppy ext3 jbd raid1 ata_piix libata sd_mod scsi_mod Jun 9 19:30:34 jin kernel: CPU: 1 Jun 9 19:30:34 jin kernel: EIP: 0060:[<c01447bd>] Tainted: GF B VLI Jun 9 19:30:34 jin kernel: EFLAGS: 00010202 (2.6.9-5.0.5.ELsmp) Jun 9 19:30:34 jin kernel: EIP is at shrink_list+0xa9/0x3ee Jun 9 19:30:34 jin kernel: eax: 20001049 ebx: f7cedecc ecx: c159f4f8 edx: c10f24d8 Jun 9 19:30:34 jin kernel: esi: c159f4e0 edi: 00000021 ebp: f7cedf58 esp: f7cede54 Jun 9 19:30:34 jin kernel: ds: 007b es: 007b ss: 0068 Jun 9 19:30:34 jin kernel: Process kswapd0 (pid: 44, threadinfo=f7ced000 task=f7d1b7b0) Jun 9 19:30:34 jin kernel: Stack: 00000001 00000000 00000000 00000000 f7cedecc f7cede68 f7cede68 00000000 Jun 9 19:30:34 jin kernel: 00000001 c12f4be0 c1204a00 00000246 f7ceded4 c0319e00 00000000 f7ceded4 Jun 9 19:30:34 jin kernel: c0143bc0 c10639f8 00000296 c1f479c0 c10639e0 00000000 00000020 f7ced000 Jun 9 19:30:34 jin kernel: Call Trace: Jun 9 19:30:34 jin kernel: [<c0143bc0>] __pagevec_release+0x15/0x1d Jun 9 19:30:34 jin kernel: [<c0144cdf>] shrink_cache+0x1dd/0x34d Jun 9 19:30:34 jin kernel: [<c014539d>] shrink_zone+0xa7/0xb6 Jun 9 19:30:34 jin kernel: [<c0145740>] balance_pgdat+0x1b6/0x2f8 Jun 9 19:30:34 jin kernel: [<c014594c>] kswapd+0xca/0xcc Jun 9 19:30:34 jin kernel: [<c011e8d2>] autoremove_wake_function+0x0/0x2d Jun 9 19:30:34 jin kernel: [<c02c6206>] ret_from_fork+0x6/0x14 Jun 9 19:30:34 jin kernel: [<c011e8d2>] autoremove_wake_function+0x0/0x2d Jun 9 19:30:34 jin kernel: [<c0145882>] kswapd+0x0/0xcc Jun 9 19:30:34 jin kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb Jun 9 19:30:34 jin kernel: Code: 71 e8 89 50 04 89 02 c7 41 04 00 02 20 00 c7 01 00 01 10 00 f0 0f ba 69 e8 00 19 c0 85 c0 0f 85 b8 02 00 00 8b 41 e8 a8 40 74 08 <0f> 0b 79 01 41 9a 2d c0 8b 41 e8 f6 c4 20 0f 85 96 02 00 00 8b Here is my cluster.conf: <?xml version="1.0"?> <cluster name="ftp" config_version="1"> <cman two_node="1" expected_votes="1"> </cman> <clusternodes> <clusternode name="jin-p"> <fence> <method name="single"> <device name="sanbox2" port="1"/> </method> </fence> </clusternode> <clusternode name="mugen-p"> <fence> <method name="single"> <device name="sanbox1" port="1"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="sanbox1" agent="fence_sanbox2" ipaddr="10.0.19.30" login="admin" passwd="p00-sm3llz"/> <fencedevice name="sanbox2" agent="fence_sanbox2" ipaddr="10.0.19.31" login="admin" passwd="p00-sm3llz"/> </fencedevices> <fence_daemon post_join_delay="20"> </fence_daemon> </cluster> ~ -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster