On Wed, 2007-09-26 at 10:33 +0200, Borgström Jonas wrote: > Hi again, > > After stress testing a gfs filesystem for 24 hours fsck.gfs complains about "Found unlinked inode". > This scared me so I reran the test again but got the same result. > > My test consists of two nodes running bonnie++, postgresql and pgbench against a single file system. Every five > minutes one of the nodes is shot. > > The weird part is that on both occasions the thing fsck.gfs complained about was an "unlinked inode" corresponding to a postgresql pid file. This is a file created (and deleted) every time postgresql is failed over to another node. It is also the last file on the filesystem being deleted when postgresql was shutdown before the filesystem was umounted and fsck.gfs was run. > > Can anybody explain why this pid file triggered this fsck error twice and not any of the thousands of files created and deleted by bonnie++? > > Does this mean the filesystem is corrupt, or is this an expected behavior for files deleted directly before a filesystem is umounted? > > BTW: I'm not able to reproduce this by simply mounting the filesystem, starting/stopping pgsql and umounting. I need to leave the test running over night. I've also performed some tests directly on the SAN device and as far as I can tell it's working as expected. > > OS: RHEL5 Advanced platform Hi Jonas, Well, I can think of one possible explanation. I can't be sure because I don't know your test scenario, but this is my theory. First, a bit of background: When a node gets "shot" as you say, the metadata for some of its recent operations are likely to still be in the journal it was using. Depending on the circumstances of when it gets shot, that data may exist only in the journal if it got "shot" before the data was written to its final destination on disk. Ordinarily, that's not a big deal because the next time the file system is mounted, the journal is replayed and that causes the metadata to be written correctly to its proper place on disk and all is well. That's the same for most journaling file systems afaik. A couple years ago, one of my predecessors (before I started) made an executive decision to make gfs_fsck *clear* the system journals rather than replay the journals. I don't know offhand if the code was once there and got taken out or if it was never written. At any rate, it seemed like a good idea at the time and there were several good reasons to justify that decision: First, if the user is running gfs_fsck, they must already suspect file system corruption. If (and this is a big if) that corruption was caused by recent operations to the file system, then replaying the journal can only serve to compound the corruption and cause more corruption. That's because what is in the journal may also be based on the corruption. This was more of a concern if, for some reason, GFS bailed out and "withdrew" from the file system because it detected corruption, suspecting that it must have somehow caused that corruption. Second, if the user is running gfs_fsck because of corruption, we may not be able to assume that the journal is good metadata, worthy of being replayed. Third, the user always has the option of replaying the journal before doing gfs_fsck: 1. mount the file system after the crash (to replay the journal) 2. unmount the file system 3. run gfs_fsck The decision to have gfs_fsck clear the journals was probably made many years ago, before gfs was stable, and these "withdraw" situations were more common. Some people believe that this was a bad decision. I believe that it makes more sense to trust the journal and replay it before doing the rest of the fsck operations because in "normal" cases where a node dies (often for some reason unrelated to gfs, like getting shot, fenced, losing power, blowing up a power supply, etc.) you have the potential to lose metadata unless the journal is replayed. Other journaling file systems replay their journals during fsck or else they inform the user, ask them to take steps to replay the journal (as above), or give them the option to clear them, etc. So far, gfs_fsck does not do that. It just clears the journals. To remedy the situation, I've got an open bugzilla 291551 (which may be marked "private" because it was opened internally--sorry) at least in the gfs2_fsck case. (gfs_fsck will likely be done too). With that bugzilla, I intend to somehow remedy the situation. Either I'll ask the user if they want the journal replayed or else I'll replay them automatically, or try to detect problems with them. I'm not certain that this is the cause of your corruption, but it's the only one I can think of at the moment. Regards, Bob Peterson Red Hat Cluster Suite -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster