wed is the best day this week for me. Lets see if we can arrange for that. Matthew Galgoci wrote:
Some time on friday, cvs-int.fedora.phx.redhat.com sustained undetermined storage problems and resulting filesystem corruption. As best I can figure, we had a one drive in a raid6 array drop offline, and another disk in that array emit scsi errors. Now, you're probably thinking, this is raid6, it should have been able to sustain losing two disks and keep on going. Well, you're right and you're wrong. If two disks had simply dropped out of the array, we'd be fine. That wasn't the case however. Somewhere in the equation is data corruption. raid is great up until your hardware corrupts the data. To support this claim, all you need to do is realize that we sustained numerous ext3 errors and had the journal abort, and the root fs went read-only. I did my level best to revive the system on friday and saturday. I was able to get it pxe booted onto rescue media, which helped recovery immensely. I took numerous screen shots to chronical what I went through as I attempted to recover the raid6 arrays and the logical volumes. http://people.redhat.com/~mgalgoci/cvs-int.jpg http://people.redhat.com/~mgalgoci/cvs-int2.jpg http://people.redhat.com/~mgalgoci/cvs-int3.jpg http://people.redhat.com/~mgalgoci/cvs-int4.jpg http://people.redhat.com/~mgalgoci/fedora-cvs5.jpg http://people.redhat.com/~mgalgoci/fedora-cvs6.jpg http://people.redhat.com/~mgalgoci/fedora-cvs8.jpg http://people.redhat.com/~mgalgoci/fedora-cvs9.jpg http://people.redhat.com/~mgalgoci/fedora-cvs10.jpg http://people.redhat.com/~mgalgoci/fedora-cvs11.jpg http://people.redhat.com/~mgalgoci/fedora-cvs12.jpg http://people.redhat.com/~mgalgoci/fedora-cvs13.jpg http://people.redhat.com/~mgalgoci/fedora-cvs14.jpg http://people.redhat.com/~mgalgoci/fedora-cvs15.jpg http://people.redhat.com/~mgalgoci/fedora-cvs18.jpg http://people.redhat.com/~mgalgoci/fedora-cvs17.jpg http://people.redhat.com/~mgalgoci/fedora-cvs16.jpg http://people.redhat.com/~mgalgoci/fedora-cvs19.jpg http://people.redhat.com/~mgalgoci/fedora-cvs20.jpg After #20, I said the hell with it, time to move on. We've installed one of the new Dell 2950 machines that Dell was kind enough to donate to the Fedora Project. Mike McGrath is in the process of updatifying and restorifying the data from backups. I have a Dell tech coming on site again today to do some more work on the old new cvs-int server. I think we know what the issues are on it and we'll have it usable again in the next day or so. In the mean time, I think we need to take a look at all the Dell fedora boxes and check the scsi drives in them. There are known issues with certain drive firmware that cause drives to go offline and report spurrious errors. The relevant Dell update is here: http://support.us.dell.com/support/downloads/download.aspx?c=us&cs=555&l=en&s=biz&releaseid=R123859&formatcnt=1&libid=0&fileid=164751 We'll need downtime and hands on site to do this update. I'm sure Stacy will be able to assist.
-- ======================================================== = Stacy J. Brandenburg Red Hat Inc. = = Manager, Network Operations sbranden@xxxxxxxxxx = = 919-754-4313 http://www.redhat.com = ========================================================