Re: filesystem corruption?

"Valeri Galtsev" <galtsev@xxxxxxxxxxxxxxxxx> · Mon, 6 Apr 2015 21:21:12 -0500 (CDT)

On Mon, April 6, 2015 4:37 pm, m.roth@xxxxxxxxx wrote:
> Got an older server here, running CentOS 6.6 (64-bit). Suddenly, at
> 0-dark-30 yesterday morning, we had failures to connect.
>
> After several tries to reboot and get working, I tried yum update, and
> that failed, complaining of an python krb5 error. With more investigation,
> I discovered that logins were failing as there was a problem with pam;
> this turned out to be it couldn't open /lib64/security/pam_permit.so. The
> reason for that was that it was a broken symlink, pointing to a file in
> the same directory, that actually existed in the /lib64. Checking other
> systems, I found it should, in fact, be a file, not a symlink.
>
> At this point, the system was considered suspect. I brought the system
> down, replaced the root drive, and rebuilt. I was not able to build it as
> CentOS 7, as something in the older hardware broke the install. CentOS 6
> built successfully, and the server was returned to service.
>
> I then loaded the drive in another server, and examined it. fsck reported
> both / and /boot were clean, but when I redid this with fask -c, to check
> for bad blocks, it found many multiply-claimed blocks.
>
> First question: anyone have an idea why it showed as clean, until I
> checked for bad blocks? Would that just be because I'd gracefully shut
> down the original server, and it mounted ok on the other server?
>
> Mounting it on /mnt, I found no driver errors being reported in the logs,
> nor anything happening, including logons, before an automated contact from
> another server, which failed. AND I checked our loghost, and nothing odd
> shows there, neither in message nor in secure.
>
> At this point, I *think* it's filesystem corruption, rather than a
> compromised system, but I'd really like to hear anyone's thoughts on this.
>
>       mark
>

  Someone has suggested to reformat disk. Before doing that you may want
to make an image of the whole drive as it is now: dd the whole device
into file (somewhere on huge filesystem). I definitely would do that
before even running fsck or badblocks (BTW, badblocks has
non-destructive mode) - too late to mention now. You may need this image
for future forensics.

The best would be to have some system integrity suite installed before bad
event, then you will be able to tell what exactly changed (and
approximately when). Alas, you don't seem to have that option. You should
be able to use backup as a sort of replacement for that: (hopefully you
back up system area as well). I would restore all on the closest date
before event, compare all you had with what you see on mounted image(s) of
your drive (I would definitely play with copy of copy of image, leaving
original intact). I definitely would mount them read only with no journal.
Take a look in logs what kind of events you find there. Check that logs
were not tampered with (chkrootkit may be your friend). Take a look who
logged when for how long (and from where!), see if there is correlation
with some segfaults or kernel oopses, or if some kernel modules were
loaded (should they be loaded all of a sudden?). Anyway, take some
forensics guide if you don't do forensics often, and follow it. May take a
couple of weeks depending on how busy you are in general. Good luck with
that.

Hardware (drive) hypothesis. It is very attractive. I would kick myself so
wishful thinking will not take over. But if you indeed noticed bad blocks
detected, this quite likely is your case. Again, logs must have records as
drive will report its hardware events. I also would check SMART status of
drive. Try to get some information from drive (hdparm comes to my mind,
careful, you don't want to change anything which mostly hdparm is used
for, just collect info). After everything else tried I would run hard
drive fitness test (vendors have downloadable utility). BTW, what is
model/manufacturer of the drive?

[There is one more possibility which unlikely is your case: bad memory, or
just just small memory error but in really bad place that cased big
consequences. Reboot would resolve trouble, so it is unlikely your case.
But if this hits specific place in RAM, it can cause corruption of
filesystem as well...]

Good luck! Let us know what you find out.

Valeri

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos