Hello Bob, First of all, thanks for the detailed answer. On Mon, Oct 08, 2007 at 03:26:56PM -0500, Bob Peterson wrote: > This is odd. What it means is this: GFS was searching for a free block > to allocate. The resource group ("RG"--not to be confused with > rgmanager's resource groups) indicated at least one free block for that > section of the file system, but there were no free blocks to be found in > the bitmap for that section (a direct contradiction). Therefore, the > file system was determined to be corrupt. What explains the files in l+f. The files where created a day before the crash, they disappeared some time later and where recreated, but I was not informed about this before the crash happened. In this particular setup, the "storage" is a 2 nodes heartbeat(2.1.2) failover cluster exporting a drbd 8.2.0 device via iscsi (iscsitarget 0.4.15). The systems are rhel5 with stock 2.6.18-8.1.14.el5 kernel, drbd and heartbeat selfcompiled. The iscsi target is exported without data and header digest, I will switch it to crc32c now, to rule out the network. > (3) Another possibility is a hardware problem > with your media--the hard drives, FC switch, HBAs, etc. This could > happen, for example, if GFS read the bitmap(s) from disk and the disk > returned the wrong information. We've seen a lot of that, and the best > thing to do is test the media (but it's a tedious and sometimes > destructive task). I wonder if a drbd failover might cause something like this, manually switching the resources to the backup node: the drbd share is made secondary, iscsi-target stopped, the service IP removed, and on the other node the IP activated, the drbd device made primary and iscsitarget started. The clients issue a connection0:0: iscsi: detected conn error (1011) and everything continues after the switch, which takes 4-5 seconds. OTOH, there was no failover since before the gfs filesystem was formatted and populated, only some scheduled node reboots. I wonder if broken memory in one of the nodes or the iscsi servers could be to blame, but I did not see segfaults or MCEs anywhere, GFS does direct-I/O and we had iscsitarget set to blockio, which is uncached direct-I/O too. > If you had not run gfs_fsck, we might have been able to tell a little > bit more about what happened from the contents of the journals. > For example, in RHEL5 and equivalent, you can use gfs2_edit to save > off the file system metadata and send it in for analysis. > (gfs2_edit can operate on gfs1 file systems as well as gfs2). However, > Since gfs_fsck clears the journals, that information is now long gone. Oh, thanks for the hint, I'll do this in case this happens again. Best regards Frederik Schüler -- ENOSIG
Attachment:
signature.asc
Description: Digital signature
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster