On Jan 4, 2008 10:20 PM, Dennison Williams <
evoltech@xxxxxxxxxxx> wrote:
> Did you try and re-insert the kicked-out drive as if it was clean, or didI re-inserted it with:
> you try to re-sync it to the existing filesystem. If the former, then
> that's a HUGE mistake because the data on the drive is no longer in sync
> with what is on the other drives. (unless the entire filesystem was made
> read-only when (or before) the drive was dropped out.)
mdadm /dev/md0 --add /dev/sde
At which point it seemed to resync with the raid device (ie. the output
of /proc/mdstat showed that it was incrementally syncing)there are messages like this:
> Check the SMART logs for each of the drives to see if they've had any
> problems.
/dev/sdc, failed to read SMART Attribute Data
...but this wasn't one of the disks that was removed from the raid device
If there are complaints about SDC, then I'd be inclined to do a long test of it
in smart. it's possible that the real problem started here.
A badblock read test (or just a dd if=/dev/sdc of=/dev/null) would also test the I/O path between the drive and the CPU. If there are complaints about that drive, then .. at this point, you should consider it suspicious.
>
> Try pullling the (candidate) compromized drive out of the array and see if
> the (degraded) filesystem works OK and has good data. If it does, then I'd
> guess that the pulled drive had bad data written to it somehow --- re-add it
> (as if it was hot-swapped in), and hope it doesn't happen again.
> Try that with each of the drives, in turn until you find the badly written
> drive. If one of the drives has badly written data, the system really can't
> tell, for sure, which one is wrong.
in smart. it's possible that the real problem started here.
A badblock read test (or just a dd if=/dev/sdc of=/dev/null) would also test the I/O path between the drive and the CPU. If there are complaints about that drive, then .. at this point, you should consider it suspicious.
>
> Try pullling the (candidate) compromized drive out of the array and see if
> the (degraded) filesystem works OK and has good data. If it does, then I'd
> guess that the pulled drive had bad data written to it somehow --- re-add it
> (as if it was hot-swapped in), and hope it doesn't happen again.
> Try that with each of the drives, in turn until you find the badly written
> drive. If one of the drives has badly written data, the system really can't
> tell, for sure, which one is wrong.
I want to make sure I understand you here. Say my raid device is
comprised of for devices /dev/md0 = /dev/sd[abcd], are you sugesting
that for each drive I do somthing like this:
mdadm /dev/md0 --fail /dev/sda --remove /dev/sda
Don't bother. If the drive got resynced, then pulling it won't do any good unless software RAID gets silently confused by random data on one plex,
then try to mount up the FS as usual to see if it is there? Wouldn't
this point be moot if the device already re-assembled itself?
Yes. it would be moot.
It wasn't read-only, but nothing was writing to it.
>
> [[ unless the array was read-only when the drive was dropped, then you will
> only have any hope of good data with the dropped drive pulled ]]
Thanks for your time and prompt response.
Sincerely,
Dennison Williams
Unless noatime was set, then the drive was being written to (if only atime data). if all that got scrambled was atime data you should still have been able to mount the drive.
--
Stephen Samuel http://www.bcgreen.com
778-861-7641
_______________________________________________ Ext3-users mailing list Ext3-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/ext3-users