On Mon, 11 Apr 2011 09:10:08 -0400 Ted Ts'o <tytso@xxxxxxx> wrote: > Your symptoms don't sound familiar to me, other than the standard > concerns about hardware induced file system inconsistency problems. Thing is, I do not observe any in-file random data corruptions which would point to a problem at a lower (block-device) level, so I do not think it is a RAID or HDD problem. The breakage seemed to be on the filesystem logic level, perhaps something to do with allocation of space for new files? And since I immediately just before that, made two operations possibly affecting it (tune2fs stride size + online grow with resize2fs) that's why I thought this might be an ext4 problem. While still in the same session, I then re-copied the affected files replacing their "shortened" copies, and they were written out fine the second time. And after a reboot, no more file truncations are observed so far. > Have you checked your logs carefully to make sure there weren't any > hardware errors reported? No, there weren't any errors in dmesg, or on the same console where 'cp' would output its errors. > If this is a hardware RAID system, is it regularly doing disk scrubbing? > Has the hardware RAID reported anything unusual? How long have you been > running in a degraded RAID 6 state? It is an mdadm RAID6, and it does not report any problem. It was running in a degraded state for only a short time (less than a day). And AFAIK running degraded without one disk is not a dangerous or risky situation with RAID6. > And have you tried shutting down the system and running fsck to make > sure there weren't any file system corruption problems? When's the > last time you've run fsck on the system? I have unmounted it and ran fsck just now. Admittedly there was a long time since the last fsck. # e2fsck /dev/md0 e2fsck 1.41.12 (17-May-2010) /dev/md0 has gone 306 days without being checked, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/md0: 367107/364412928 files (4.3% non-contiguous), 1219229259/1457626752 blocks > If this is an LVM system, I'd strongly suggest that you set aside > space you can take a snapshot, and then regularly take a snapshot, and > then run fsck on the snapshot. If any problems are noted, you can > then schedule downtime and fsck the entire system. No, I don't use LVM there. -- With respect, Roman
Attachment:
signature.asc
Description: PGP signature