On Mon, 2007-10-08 at 21:33 +0200, Frederik Schueler wrote: > Hello, > > I just got a crash on a gfs share: > > GFS: fsid=beta:helium.1: fatal: assertion "x <= length" failed > GFS: fsid=beta:helium.1: function = blkalloc_internal > GFS: fsid=beta:helium.1: file = /usr/src/modules/redhat-cluster/gfs/gfs/rgrp.c, line = 1458 > GFS: fsid=beta:helium.1: time = 1191842568 > GFS: fsid=beta:helium.1: about to withdraw from the cluster > GFS: fsid=beta:helium.1: waiting for outstanding I/O > GFS: fsid=beta:helium.1: telling LM to withdraw > lock_dlm: withdraw abandoned memory > GFS: fsid=beta:helium.1: withdrawn > > > the system is running gfs 1.04 with linux 2.6.21. > > after the crash, I rebooted the concerned node and run an fsck on > another node to check the filesystem in question, and now it has a dozen > of lost files in l+f. > > How can I debug the issue? > > Best regards > Frederik Schüler Hi Frederik, This is odd. What it means is this: GFS was searching for a free block to allocate. The resource group ("RG"--not to be confused with rgmanager's resource groups) indicated at least one free block for that section of the file system, but there were no free blocks to be found in the bitmap for that section (a direct contradiction). Therefore, the file system was determined to be corrupt. It's nearly impossible to say how this could have happened. Here are a few possibilities: (1) It's possible that some rogue kernel module overwrote the bitmap memory. (2) This can also happen if gfs_fsck is run on a file system that is already mounted from another node. (3) Another possibility is a hardware problem with your media--the hard drives, FC switch, HBAs, etc. This could happen, for example, if GFS read the bitmap(s) from disk and the disk returned the wrong information. We've seen a lot of that, and the best thing to do is test the media (but it's a tedious and sometimes destructive task). For more information on that, see: http://sources.redhat.com/cluster/faq.html#gfs_corruption (4) It's also possible--although unlikely--it could be a GFS bug, although as far as I know, you're the only person to report such a thing. If it is really a GFS bug, the best way to solve it (and sometimes the only way to solve it) is to give us a way to recreate the corruption using a clean file system and a recreation program. If that's not possible, you could describe what was happening to the file system at the time of failure, in as much detail as possible, and we can do some experiments here. For example: were there lots of file renames going on? directory renames? file creates? What kind of IO was happening to the file system at the time? But doing these experiments is often just a waste of time. If you had not run gfs_fsck, we might have been able to tell a little bit more about what happened from the contents of the journals. For example, in RHEL5 and equivalent, you can use gfs2_edit to save off the file system metadata and send it in for analysis. (gfs2_edit can operate on gfs1 file systems as well as gfs2). However, Since gfs_fsck clears the journals, that information is now long gone. I hope this helps. Regards, Bob Peterson Red Hat Cluster Suite -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster