We have several large ext3 file system partitions. One of them sets
itself to read-only after getting journel problems. I understand that's
a good thing, but obviously I need to correct the problem so that it
will stop locking itself. Here are some details;
OS is Redhat EL4 x86_64 running on a SunFire v40z, kernel is
2.6.9-42.0.2.ELsmp. The disk storage in question is external, via fiber
cable. The fiber HBA is a Qlogic ISP2312 connected to a Qlogic San
Switch connected to four Apple Xserve Raids. There are 8 individual
LUN's coming from the four XRaids, they appear on the host as
/dev/sd[cdefghij]. Those LUNs are put into two LVM volume groups and
then mounted from logical volumes.
The partition in question is 8TB, about 92% full at the moment. One
oddity about this partition is it has a subdirectory which contains over
2700 symbolic links to other partitions. Here is the output from
/var/adm/messages the last time the file system locked itself;
Jul 17 09:01:06 kernel: Info fld=0x0, Current sdd: sense key No Sense
Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3):
ext3_free_blocks_sb: bit already cleared for block 786856796
Jul 17 09:01:06 kernel: Aborting journal on device dm-3.
Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3) in
start_transaction: Readonly filesystem
Jul 17 09:01:06 kernel: Aborting journal on device dm-3.
Jul 17 09:01:06 kernel: ext3_abort called.
Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3):
ext3_journal_start_sb: Detected aborted journal
Jul 17 09:01:06 kernel: Remounting filesystem read-only
Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3) in
start_transaction: Journal has aborted
Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3):
ext3_free_blocks_sb: bit already cleared for block 786856797
Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3):
ext3_free_blocks_sb: bit already cleared for block 786856798
Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3):
ext3_free_blocks_sb: bit already cleared for block 786856799
Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3):
ext3_free_blocks_sb: bit already cleared for block 786856800
Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3) in
ext3_reserve_inode_write: Journal has aborted
Jul 17 09:01:06 kernel: EXT3-fs error (device dm-3) in ext3_truncate:
Journal has aborted
Jul 17 09:01:07 kernel: EXT3-fs error (device dm-3) in
ext3_reserve_inode_write: Journal has aborted
Jul 17 09:01:07 kernel: EXT3-fs error (device dm-3) in ext3_orphan_del:
Journal has aborted
Jul 17 09:01:07 kernel: EXT3-fs error (device dm-3) in
ext3_reserve_inode_write: Journal has aborted
Jul 17 09:01:07 kernel: EXT3-fs error (device dm-3) in
ext3_delete_inode: Journal has aborted
Jul 17 09:01:07 kernel: __journal_remove_journal_head: freeing
b_committed_data
If I run fsck it does seem to repair bad blocks and clears inodes but
of course for 8TB it takes a long time to run and the corruption only
comes back later.
I have considered upgrading the kernel, it could be done. I think
part of the problem is the large number of symbolic links on that
partition but without evidence it will be difficult to get people to
change it. I also don't like the first line in the messages about
device sdd getting a "No Sense" response to a SCSI sense key request.
Any good advice on how to proceed would be appreciated. I have
looked at the dumpe2fs and debugfs tools but I don't see how to put them
to good use in this case.
Thomas Walker
_______________________________________________
Ext3-users mailing list
Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users