Sorry for the slow response. We ended up (finally) purchasing some RHEL
licenses to try and get some phone support for this problem, and came up
with a plan to salvage what we could. I'll try to offer a brief history
of the problem in hope you can help me understand this issue a little
better.
I've posted the relevant logfile entries to the events described here :
http://kai.gnukai.com/gfs2_meltdown.txt
All the nodes send syslog to a remote server named pxe, so the combined
syslog for all the nodes plus the syslog server is here:
http://kai.gnukai.com/messages.txt
We started with a 4 node cluster (nodes 1, 2, 4, 5). The GFS2 filesystem
was created with the latest CentOS 5.3 had to offer when it was
released. Node 3 was off at the time the errors occurred, and not part
of the cluster.
First issue I can recover from syslog is from node 5 (192.168.100.105)
on Sep 8 14:11:27 was a 'fatal: invalid metadata block' error that
resulted in the file system being withdrawn.
Next was node 4 (192.168.100.104) to hit a 'fatal: filesystem
consistency error' that also resulted in the file system being
withdrawn. On the systems themselves, any attempt to access the
filesystem would result in a I/O error response. At the prospect of
rebooting 2 of the 4 nodes in my cluster, I brought node 3
(192.168.100.103) online first. Then I power cycled nodes 4 and 5 one at
a time and let them come back online. These nodes are running Xen, so I
start to bring the VMs that were on nodes 4 and 5 online on nodes 3-5
after all 3 had joined the cluster.
Shortly thereafter, node 3 encounters the 'fatal: invalid metadata
block', and withdraws the file system. Then node 2 (.102) encounters
'fatal: invalid metadata block' also, and withdraws the filesystem. So I
reboot them.
During their reboot, nodes 1 (.101) and 5 hits the same 'fatal: invalid
metadata block' error. I waited for nodes 2 and 3 to come back online to
preserve the cluster. At this point, node 4 was the only node that still
had the filesystem mounted. After I had rebooted the other 4 nodes, none
of them could mount the files system after joining the cluster, and node
4 was spinning on the error:
Sep 8 16:54:22 192.168.100.104 kernel: GFS2:
fsid=xencluster1:xenclusterfs1.0: jid=4: Trying to acquire journal lock...
Sep 8 16:54:22 192.168.100.104 kernel: GFS2:
fsid=xencluster1:xenclusterfs1.0: jid=4: Busy
It wasn't until this point that we suspected the SAN. We discovered that
the SAN had marked a drive as "failed" but did not remove it from the
array and begin to rebuild on the hot spare. When we physically removed
the failed drive, the hot spare was picked up and put into the array.
The VMs on node 4 were the only ones "running" but they had all crashed
because their disk was unavailable. I decided to reboot all the nodes to
try and re-establish the cluster. We were able to get all the VMs turned
back on, and we thought we were out of the dark, with the exception of
the high level of filesystem corruption we caused inside 30% of the VM's
filesystems. We ran them through their ext3 filesystem checks, and got
them all running again.
Then at the time I send the original email, we were encountering the
same invalid metadata block errors on the VMs at different points.
With Redhat on the phone, we decided to migrate as much data as we could
from the original production SAN to a new SAN, and bring the VMs online
on the new SAN. There were a total of 3 VM disk images that would not
copy because they would trigger the invalid metadata block error every
time. After the migration, we tried 3 filesystem checks, all of which
failed, leaving the fsck_dlm mechanism configured on the filesystem. We
were able to override the lock with the instructions here:
http://kbase.redhat.com/faq/docs/DOC-17402
We were able to remount the gfs2 filesystem again, but with out any
improvement. If we tried to copy those same files, it would withdraw the
filesystem. We were able to mount the disk images and recover some
files, and we feel like we got lucky that none of those files also
triggered the filesystem withdraw.
Today, we feel fairly confident that the underlying issue began with the
disk being marked as failed, but not being removed from the array. We've
contacted the hardware vendor of the SAN, but the only response they
offered was, "That shouldn't have happened, and it shouldn't happen again."
I am very interested in your response, but at this point there isn't any
urgency. The old production SAN is in the lab, and we are running smart
tests on each of the disks to see if any of the disks are salvageable.
The new SAN we put into production has all new drives (different make an
model), and we hope we don't encounter any further issues.
The last issue I want to investigate is phasing out all my CentOS 5.3
servers, and installing RHEL 5.4 servers one at a time, so I don't have
to take down the cluster. The intent is to live migrate all the VMs to
the RHEL 5.4 servers to keep the availability on the VMs as high as
possible, but that's another topic I'll probably post about later after
some testing in the lab.
-Kai Meyer
Steven Whitehouse wrote:
Hi,
On Sat, 2009-09-19 at 05:16 -0600, Kai Meyer wrote:
I have a 5 node cluster running kernel 2.6.18-128.1.6.el5xen and
gfs2-utils-0.1.53-1.el5_3.3 . Twice in 10 days, each node in my cluster
has failed with the same message in /var/log/messages. dmesg reports the
same errors, and on some nodes there are no other entries previous to
the invalid metadata block error.
I would like to know what issues can trigger such an event. If it is
more helpful for me to provide more information, I will be happy to, I'm
just not sure what other information you would consider relevant.
Thank you for your time,
-Kai Meyer
It means that the kernel was looking for an indirect block, but instead
found something that was not an indirect block. The only way to fix this
is with fsck (after unmounting on all nodes) otherwise the issue is
likely to continue to occur each time you access the particular inode
with the problem.
There have been a couple of reports of this (or very similar) issues
recently. The problem in each case is that the original issue probably
happened some time before it triggered the message which you've seen.
That means that it is very tricky to figure out exactly what the cause
is.
I'd be very interested to know whether this filesystem was a newly
created gfs2 filesystem or an upgraded gfs1 filesystem. Also, whether
there have been any other issues, however minor, which might have caused
a node to be rebooted or fenced since the filesystem was created? Also,
any other background information about the type of workload that was
being run on the filesystem would be helpful too.
Steve.
Sep 19 02:02:06 192.168.100.104 kernel: GFS2:
fsid=xencluster1:xenclusterfs1.1: fatal: invalid metadata block
Sep 19 02:02:06 192.168.100.104 kernel: GFS2:
fsid=xencluster1:xenclusterfs1.1: bh = 567447963 (magic number)
Sep 19 02:02:06 192.168.100.104 kernel: GFS2:
fsid=xencluster1:xenclusterfs1.1: function =
gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line
= 334
Sep 19 02:02:06 192.168.100.104 kernel: GFS2:
fsid=xencluster1:xenclusterfs1.1: about to withdraw this file system
Sep 19 02:02:06 192.168.100.104 kernel: GFS2:
fsid=xencluster1:xenclusterfs1.1: telling LM to withdraw
Sep 19 02:02:07 192.168.100.104 kernel: GFS2:
fsid=xencluster1:xenclusterfs1.1: withdrawn
Sep 19 02:02:07 192.168.100.104 kernel:
Sep 19 02:02:07 192.168.100.104 kernel: Call Trace:
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff885154ce>]
:gfs2:gfs2_lm_withdraw+0xc1/0xd0
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff80262907>]
__wait_on_bit+0x60/0x6e
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff80215788>]
sync_buffer+0x0/0x3f
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff80262981>]
out_of_line_wait_on_bit+0x6c/0x78
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8029a01a>]
wake_bit_function+0x0/0x23
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8021a7f1>]
submit_bh+0x10a/0x111
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff885284a7>]
:gfs2:gfs2_meta_check_ii+0x2c/0x38
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff88518d30>]
:gfs2:gfs2_meta_indirect_buffer+0x104/0x160
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff88509fc3>]
:gfs2:gfs2_block_map+0x1dc/0x33e
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8021a821>]
poll_freewait+0x29/0x6a
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8850a199>]
:gfs2:gfs2_extent_map+0x74/0xac
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8850a2ce>]
:gfs2:gfs2_write_alloc_required+0xfd/0x122
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff885128d5>]
:gfs2:gfs2_glock_nq+0x248/0x273
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8851a27c>]
:gfs2:gfs2_write_begin+0x99/0x36a
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8851bd1b>]
:gfs2:gfs2_file_buffered_write+0x14b/0x2e5
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8020d3a5>]
file_read_actor+0x0/0xfc
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8851c151>]
:gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8851c2f4>]
:gfs2:gfs2_file_write_nolock+0xaa/0x10f
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8022eca0>]
__wake_up+0x38/0x4f
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff80299fec>]
autoremove_wake_function+0x0/0x2e
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8022fbe4>]
pipe_readv+0x38e/0x3a2
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff80263bce>]
lock_kernel+0x1b/0x32
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8851c444>]
:gfs2:gfs2_file_write+0x49/0xa7
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff80216da9>]
vfs_write+0xce/0x174
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff802175e1>]
sys_write+0x45/0x6e
Sep 19 02:02:07 192.168.100.104 kernel: [<ffffffff8025f2f9>]
tracesys+0xab/0xb6
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster