On Tue, 5 Jun 2007, Paolo Bizzarri wrote:
On 6/4/07, Scott Marlowe <smarlowe@xxxxxxxxxxxxxxxxx> wrote:
http://lwn.net/Articles/215868/
documents a bug in the 2.6 linux kernel that can result in corrupted
files if there are a lot of processes accessing it at once.
in fact, we were using a 2.6.12 kernel. Can this be a problem?
That particular problem appears to be specific to newer kernels so I
wouldn't think it's related to your issue.
Tracking down random crashes of the sort you're reporting is hard. As
Scott rightly suggested, the source of problem could be easily be any
number of hardware components or low-level software like the kernel. The
tests required to really certify that a server is suitable for production
use can take several days worth of testing. The normal approach here
would be to move this application+data to another system and see if the
problem is still there; that lets you rule out all the hardware at once.
That would do something else you should be thinking about--making
absolutely sure you can backup and restore your data, and that the
corruption you're seeing isn't causing information to be lost in your
database.
The general flow of figuring out the cause for random problems goes
something like this:
1) Check for memory errors. http://www.memtest86.com/ is a good tool for
PCs. That will need to run for many hours.
2) Run the manufacturer's disk utilities to see if any of your disks are
going bad. You might be able to do this using Linux's SMART tools instead
without even taking the server down; if you're not using those already you
should look into that. http://www.linuxjournal.com/article/6983 is a good
intro here.
3) Boot another version of Linux and run some low-level disk tests there.
A live CD/DVD like Knoppix and Ubuntu is the easiest way to do that.
4) If everything above passes, upgrade to the kernel version used on the
live CD/DVD and see if the problem goes away.
You can try skipping right to #4 here and playing with the kernel first,
but understand that if your underlying hardware has issues, that may cause
more corruption (with possible data loss) rather than less.
--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD