Re: MTBF of Ext3 and Partition Size

Ric Wheeler <rwheeler@xxxxxxxxxx> · Thu, 16 Apr 2009 12:55:37 -0400

Theodore Tso wrote:
On Thu, Apr 16, 2009 at 07:53:59AM -0400, Kyle Brandt wrote:
On several of my servers I seem to have a high rate of server crashes do to
file system errors.  So I have some questions related to this:

Is there any Mean Time Between Failure ( MTBF) data for the ext3
file-system?

Does increased partition size cause a higher risk of the partition being
corrupted? If so, is there any data on the ratio between partition size and
the likely hood of failure?

The probability of these sorts of filesystem problems is going to be
dominated by hardware induced corruptions --- so it's not going to
make a lot of sense to talk about MTBF failures without having a
specific hardware context in mind.  If you have lousy memory, or a
lousy disk controller cable, or a cable connector which is loose then
corruptions will happen often.  If you are are located some place
where there is a strong alpha particle source, then you will have a
much greater percentage chance of bit flips.  If you use ECC memory,
and do very careful hardware selection, with enterprise-quality disks
that trade off disk capacity for a much stronger level of ECC codes,
then of course the MBTF will be much less.

(For example, there was the imfamous story in the early 1990's when
Sun had a spate of bad memory; I think it was ultimately traced to
radioactive contamination of the ceramic materials used to make their
memory chips; this caused alpha particles to cause "bit flips" and
which had the result of making their customers rather antsy,
especially since Sun tried todeny there was even a problem for quite
some time.)

So if you are having a high rate of server crashes, the first thing I
would do is to make sure you have the latest distribution updates;
it's possible it's caused by a kernel bug that has since been fixed,
but it's somewhat unlikely.  The next thing I would do is take one of
the machines that has been cashing off line, and try running a 36-48
hour memory test.

Does ext3 on hardware raid (10) increase the possibility of file system
corruption?

No, it shouldn't --- unless you have a buggy or otherwise dodgy
hardware raid controller.

						- Ted

One note is that the file system will often be the first notification that your 
hardware RAID has done something wrong - you should have a careful look at any 
logs/errors/etc that your storage maintains for you.

Can you share specifics of your system - what is the storage, which kernel, etc?

Regards,

Ric

Ric

_______________________________________________
Ext3-users mailing list
Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users