Thanks Steve, Yes, sadly I can confirm the file system has been corrupted but I still don't understand why I/O will stop flowing at the LVM level (& doesn't fence it either) and why fsck keeps crashing without a useful error message, is there any signal I can send to gfs_fsck to by pass certain stages? Also to speed up the fsck process, I was thinking of utilizing the RAM and increase the read_ahead parameter (hdparm -a) of the PV device (an AoE device) by 1GB since that will hugely optimize the sequential read and fscking is mostly a sequential read process and very bit of writings, what do you think? Herein the tail of the last fsck log file: (metawalk.c:516) Extended attributes exist for inode #34020861. (metawalk.c:413) Checking EA leaf block #34020862. (pass1.c:485) Setting block #34020862 to eattr block (pass1.c:907) Checking metadata block 34020862 (pass1.c:923) Metadata block 34020862 not an inode or free metadata (pass1.c:907) Checking metadata block 34020863 (link.c:22) Setting link count to 1 for 34020863 (metawalk.c:516) Extended attributes exist for inode #34020863. (metawalk.c:413) Checking EA leaf block #34020864. (pass1.c:485) Setting block #34020864 to eattr block (pass1.c:907) Checking metadata block 34020864 (pass1.c:923) Metadata block 34020864 not an inode or free metadata (pass1.c:907) Checking metadata block 34020865 (link.c:22) Setting link count to 1 for 34020865 (pass1.c:213) Setting 34020917 to data block (pass1.c:213) Setting 34020918 to data block (pass1.c:213) Setting 34020919 to data block (pass1.c:213) Setting 34020920 to data block (metawalk.c:516) Extended attributes exist for inode #34020865. (metawalk.c:413) Checking EA leaf block #34020866. (pass1.c:485) Setting block #34020866 to eattr block (pass1.c:907) Checking metadata block 34020866 (pass1.c:923) Metadata block 34020866 not an inode or free metadata (pass1.c:907) Checking metadata block 34020867 Thanks, -- Abraham On 6/07/2010, at 8:22 PM, Steven Whitehouse wrote: > Hi, > > It looks to me as if the fs is corrupt in some manner. Try unmounting on > all nodes and running fsck on one node on the filesystem. Make sure you > save the output of fsck in case that is useful for future debugging and > make sure you have a backup of the data in question first. > > Its tricky to say exactly what might have gone wrong (the fsck output > might give a clue) but you will certainly need fsck to fix whatever the > problem is, > > Steve. > > On Tue, 2010-07-06 at 13:22 +1200, Abraham Alawi wrote: >> The system was running well for a while but lately we had a flaky disk in the RAID array which we replaced with a healthy one but suddenly the CLVM/GFS became unusable, we can mount GFS but while listing it recursively 'ls -R' it hangs with Input/output error, can't even access the c/LVM LUN rawly using 'dd' BUT we still can access the LVM PV devices using 'dd'. Reconfiguring the LVM volume as a local one and accessing it exclusively from one node doesn't make a difference. >> >> RHEL5: 2.6.18-164.11.1.el5 >> # modinfo gfs >> filename: /lib/modules/2.6.18-164.11.1.el5/weak-updates/gfs/gfs.ko >> license: GPL >> author: Red Hat, Inc. >> description: Global File System 0.1.34-2.el5 >> srcversion: 3B1BAC4069F1A4B556A958A >> depends: dlm >> vermagic: 2.6.18-159.el5 SMP mod_unload gcc-4.1 >> >> # uname -r >> 2.6.18-164.11.1.el5 >> >> # modinfo /lib/modules/2.6.18-164.11.1.el5/kernel/drivers/block/aoe/aoe.ko >> filename: /lib/modules/2.6.18-164.11.1.el5/kernel/drivers/block/aoe/aoe.ko >> description: AoE block/char driver for 2.6.2 and newer 2.6 kernels >> author: Sam Hopkins <sah@xxxxxxxxxx> >> license: GPL >> srcversion: 42BF122979AC807F2BB50E6 >> depends: >> vermagic: 2.6.18-164.11.1.el5 SMP mod_unload gcc-4.1 >> parm: aoe_iflist:aoe_iflist=dev1[,dev2...] >> (string) >> parm: version:aoe module version 74 >> (string) >> parm: aoe_dyndevs:Use dynamic minor numbers for devices. (int) >> parm: aoe_deadsecs:After aoe_deadsecs seconds, give up and fail dev. (int) >> parm: aoe_maxout:Only aoe_maxout outstanding packets for every MAC on eX.Y. (int) >> parm: aoe_maxsectors:When nonzero, set the maximum number of sectors per I/O request in new devices. (int) >> >> # modinfo dlm >> filename: /lib/modules/2.6.18-164.11.1.el5/kernel/fs/dlm/dlm.ko >> license: GPL >> author: Red Hat, Inc. >> description: Distributed Lock Manager >> srcversion: E768995007648CA8DB078AE >> depends: configfs >> vermagic: 2.6.18-164.11.1.el5 SMP mod_unload gcc-4.1 >> module_sig: 883f3504b56fe19c59c69348c13cf1f1126a509f6ddaee3965ee8b5fcd04163669647a889a9801e09f722187d1de068c0d52cd2b99bc3d475cb6ca1a0 >> >> >> >> Herein what the kernel spits out: >> >> Jul 6 11:27:36 kiwiland kernel: GFS 0.1.34-2.el5 (built Sep 9 2009 06:54:42) installed >> Jul 6 11:27:36 kiwiland kernel: Lock_DLM (built Sep 9 2009 06:54:38) installed >> Jul 6 11:27:36 kiwiland kernel: Lock_Nolock (built Sep 9 2009 06:54:37) installed >> Jul 6 11:27:36 kiwiland kernel: Trying to join cluster "lock_dlm", "FSC:files" >> Jul 6 11:27:36 kiwiland kernel: Joined cluster. Now mounting FS... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Trying to acquire journal lock... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Looking at journal... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Acquiring the transaction lock... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Replaying journal... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Replayed 0 of 11 blocks >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: replays = 0, skips = 4, sames = 7 >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Journal replayed in 1s >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Done >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Trying to acquire journal lock... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Looking at journal... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Done >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Scanning for log elements... >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Found 2 unlinked inodes >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Found quota changes for 2 IDs >> Jul 6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Done >> Jul 6 11:27:36 kiwiland kernel: Trying to join cluster "lock_dlm", "FSC:webcluster" >> Jul 6 11:27:36 kiwiland kernel: Joined cluster. Now mounting FS... >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Trying to acquire journal lock... >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Looking at journal... >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Done >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Scanning for log elements... >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Found 0 unlinked inodes >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Found quota changes for 0 IDs >> Jul 6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Done >> Jul 6 11:27:37 kiwiland kernel: Installing knfsd (copyright (C) 1996 okir@xxxxxxxxxxxx). >> Jul 6 11:27:39 kiwiland kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory >> Jul 6 11:27:39 kiwiland kernel: NFSD: starting 90-second grace period >> Jul 6 11:32:21 kiwiland kernel: dlm: closing connection to node 1 >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Trying to acquire journal lock... >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: fatal: invalid metadata block >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: bh = 1432543247 (magic) >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: function = gfs_rgrp_read >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: file = /builddir/build/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/rgrp.c, line = 830 >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: time = 1278372781 >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: about to withdraw from the cluster >> Jul 6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: telling LM to withdraw >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Looking at journal... >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Acquiring the transaction lock... >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Replaying journal... >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Replayed 0 of 0 blocks >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: replays = 0, skips = 0, sames = 0 >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Journal replayed in 1s >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Done >> Jul 6 11:33:02 kiwiland kernel: GFS: fsid=FSC:files.0: withdrawn >> Jul 6 11:33:02 kiwiland kernel: >> Jul 6 11:33:02 kiwiland kernel: Call Trace: >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff88805018>] :gfs:gfs_lm_withdraw+0xc4/0xd3 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff80063a36>] __wait_on_bit+0x60/0x6e >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8001538b>] sync_buffer+0x0/0x3f >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff800a00e5>] wake_bit_function+0x0/0x23 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8881cc97>] :gfs:gfs_meta_check_ii+0x32/0x3e >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff88819439>] :gfs:gfs_rgrp_read+0x139/0x225 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff887fb8e8>] :gfs:glock_wait_internal+0x229/0x2c3 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff887fbd17>] :gfs:gfs_glock_nq+0x395/0x3d6 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff887fbd6e>] :gfs:gfs_glock_nq_init+0x16/0x2a >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff88817466>] :gfs:gfs_rgrp_lvb_init+0x1e/0x3f >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8881a46f>] :gfs:gfs_stat_gfs+0x213/0x273 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8881353d>] :gfs:gfs_statfs+0x67/0xea >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff800deba3>] vfs_statfs+0x63/0x7f >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8886d2ce>] :nfsd:nfsd_statfs+0x28/0x38 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff888745f8>] :nfsd:nfsd3_proc_fsstat+0x3f/0x54 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8886a1db>] :nfsd:nfsd_dispatch+0xd8/0x1d6 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff886e0529>] :sunrpc:svc_process+0x454/0x71b >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff80064644>] __down_read+0x12/0x92 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8886a746>] :nfsd:nfsd+0x1a5/0x2cb >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb >> Jul 6 11:33:02 kiwiland kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 >> Jul 6 11:33:02 kiwiland kernel: >> >> >> Another kernel spit out: >> Jul 5 02:01:19 Hercules kernel: GFS: fsid=FSC:files.0: fast statfs start time = 1278252079 >> Jul 5 03:01:16 Hercules kernel: GFS: fsid=FSC:files.0: fast statfs start time = 1278255676 >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: fatal: invalid metadata block >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: bh = 86700288 (magic) >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: function = gfs_get_meta_buffer >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: file = /builddir/build/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/dio.c, line = 1225 >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: time = 1278255737 >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: about to withdraw from the cluster >> Jul 5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: telling LM to withdraw >> Jul 5 03:02:21 Hercules kernel: GFS: fsid=FSC:files.0: withdrawn >> Jul 5 03:02:21 Hercules kernel: >> Jul 5 03:02:21 Hercules kernel: Call Trace: >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8880a018>] :gfs:gfs_lm_withdraw+0xc4/0xd3 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8001538b>] sync_buffer+0x0/0x3f >> Jul 5 03:02:21 Hercules kernel: [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff800a00e5>] wake_bit_function+0x0/0x23 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff88821c97>] :gfs:gfs_meta_check_ii+0x32/0x3e >> Jul 5 03:02:21 Hercules kernel: [<ffffffff887f7717>] :gfs:gfs_get_meta_buffer+0x1d1/0x247 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff88804193>] :gfs:gfs_copyin_dinode+0x1d/0x12f >> Jul 5 03:02:21 Hercules kernel: [<ffffffff88800d6e>] :gfs:gfs_glock_nq_init+0x16/0x2a >> Jul 5 03:02:21 Hercules kernel: [<ffffffff888043e3>] :gfs:inode_create+0x13e/0x1df >> Jul 5 03:02:21 Hercules kernel: [<ffffffff88804a5d>] :gfs:gfs_inode_get+0x9d/0xba >> Jul 5 03:02:21 Hercules kernel: [<ffffffff888053bb>] :gfs:gfs_lookupi+0x33d/0x3df >> Jul 5 03:02:21 Hercules kernel: [<ffffffff887fce57>] :gfs:ea_find_i+0x0/0x6b >> Jul 5 03:02:21 Hercules kernel: [<ffffffff888172af>] :gfs:gfs_lookup+0x363/0x41a >> Jul 5 03:02:21 Hercules kernel: [<ffffffff80025426>] igrab+0x25/0x34 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff888055a0>] :gfs:gfs_iget+0x3d/0x1f1 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff88801224>] :gfs:gfs_glock_dq+0x13c/0x14b >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8000cf01>] do_lookup+0xe5/0x1e6 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8000a22b>] __link_path_walk+0xa01/0xf42 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8000e9cc>] link_path_walk+0x42/0xb2 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8000cc9c>] do_path_lookup+0x275/0x2f1 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff80012752>] getname+0x15b/0x1c2 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff800236ba>] __user_walk_fd+0x37/0x4c >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8003f235>] vfs_lstat_fd+0x18/0x47 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8002a95a>] sys_newlstat+0x19/0x31 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8005dde9>] error_exit+0x0/0x84 >> Jul 5 03:02:21 Hercules kernel: [<ffffffff8005d116>] system_call+0x7e/0x83 >> >> >> Thanks in advance, >> >> -- Abraham >> >> '''''''''''''''''''''''''''''''''''''''''''''''''''''' >> Abraham Alawi >> >> Unix/Linux Systems Administrator >> Science IT >> University of Auckland >> e: a.alawi@xxxxxxxxxxxxxx >> p: +64-9-373 7599, ext#: 87572 >> >> '''''''''''''''''''''''''''''''''''''''''''''''''''''' >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster@xxxxxxxxxx >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster '''''''''''''''''''''''''''''''''''''''''''''''''''''' Abraham Alawi Unix/Linux Systems Administrator Science IT University of Auckland e: a.alawi@xxxxxxxxxxxxxx p: +64-9-373 7599, ext#: 87572 '''''''''''''''''''''''''''''''''''''''''''''''''''''' -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster