[Centos] Promise raid cards - software raid

dan1@xxxxxxxxxxxx (dan1) · Fri, 5 Nov 2004 07:28:30 +0100

This is a multi-part message in MIME format.

------=_NextPart_000_0042_01C4C309.0CA4DEB0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Thanks again for those answers which I'm sure will also interest other
people.

If I understand correctly, I can use software RAID with no disadvantages
over hardware RAID as for reliability in cases of power failures (as long as
we use journalised filesystem), is that correct ?
I'm not in fear about he CPU increase for software RAID, that isn't a
problem to me.
What else do you think I would loose by using software RAID instead of
hardware raid ?

Also, sorry to insist about power failures, but here is the reason: my
provider allows me to have remote reboot on my server, which is very helpful
when the system hangs suddenly as I have no physical access. Curently I must
do a hardware reboot 1-2 times per year and the system starts and runs again
correctly. These remote reboots are power failure types and are more and
more frenquently offered to customers. It avoids us to ask the technician to
reboot the machine.

Your professionnal experience is very instructive. All you said about the
flush problem is thrilling. Do you mean then that having battery bakup RAID
will not help much regarding reliability in cases of power failures ? Then
many people have been tricked ?

Last: is there a mean of knowing what datas have been lost when a power
failure happened ? Can we see that in terms of sectors, clusters or in terms
of files which are corrupted ? Can we see a list of those ?

Thank you a lot, Terrence !

Daniel

  ----- Original Message ----- 
  From: Terrence Martin
  To: dan1
  Sent: Friday, November 05, 2004 2:01 AM
  Subject: Re: [Centos] Promise raid cards - software raid

  dan1 wrote:

  > Hello, Terrence.
  >
  > Thank you for your complete answers. That's very interesting.
  >
  >
  > > I am not sure what you mean about the file system crashing?
  >
  > I meant that it becomes unrecoverable, or that some datas are missing.
  >

  These are two seperate problems with two seperate solutions.

  You can have loss of data without the file system having any problems.
  That is data is missing, perfectly working file system. In fact
  journaled file systems pick preserving the file system over missing data
  every time.

  Eg.

  You have a database and are writing a query to the database that inerts
  1000 records. If you do not insert all 1000 records then you cannot use
  any of them. That is record 1000 depends on all the previous 999.  On
  record 678 the power fails. When the system comes back what happened to
  the first 678 inserts? You assume they complete but since you need all
  1000 for the data to be valid you basically have to delete the ones you
  did insert and start again, if you can.

  Deleting means that you first have to know what records to delete.

  Depending on the complexity of the inserts (they could touch dozens of
  tables and interrelationships) you could have a lot of work ahead of you
  to manually find what is a record that is part of that incomplete set
  and what records are not.

  This is where a journal comes in. A journal records when data is
  written. Basically you record what you write after you finish writing a
  record so that worst case you can replay that journal of changes to back
  out of what happened. This is what a journaled file system does. It
  allows you to more easily back out of the incomplete data problem
  quickly. This is opposed to the old method of file system consistency
  checking which was like a manual search of the entire database. If you
  only have to go over the changes, rather than searching the whole
  database it is faster.  Also the long manual check is complicated and
  prone to error. For speed and accuracy a journal is better.

  However journaled file systems do not save the data. In fact journaled
  file systems will throw away data if it is incomplete. Say in the above
  example for some reason that you could use the first 678 records, or
  that there was no other way to recover the 1000 records again so 678 was
  better than nothing. Well a journal does not care. It simply looks to
  see if the whole transactions completed of 1000 records. If it didn't it
  deletes everything up to the failure. Even if 999 of the 1000 records
  was written it will still delete the 999. The assumption being that it
  is better to have a consistent file system and protect the good data
  than have partially written data, that while valuable is inconsistent.

  If you want to ensure that even partial data is preserved you have to do
  other things to protect it.  A battery protect RAID card is one very
  very narrow approach that solves one specific type of failure state.
  Where data is written to the RAID card but not to disk yet.

  Lets go back to our above example

  You have a power failure on record 678. The raid card has memory to
  store 5 records. At the time of the power failure it has only sent the
  first 673 records to the disks for writting. The other 5 are in the
  controller cache. If you have a battery on that memory you will save
  those 5 records. However does it matter? Afterall 678 or 673 they are
  still not 1000. Also the disks themselves may store in their cache 2
  records. So the disk has only written records up to 670 with records 671
  and 672 still waiting in volatile RAM with no battery backup attached
  directly to the disk (write back cache on all PATA drives).

  The power fails and the system comes back online. The RAID card writes
  records 673-678 to the disks and they write them. Unfortunately records
  671 and 672 are lost because they were in volatile disk cache on the
  disk itself.

  So you have records 0-670,637-678. You in fact have a hole in what you
  have on disk, and who cares anyway because the journaled file system is
  going to delete all those records when it goes to work to ensure
  integrity over data preservation.

  Basically RAID batteries buy you something, but not much, and they buy
  you even less when they are attached to ATA drives that have write back
  cache that essentially makes the RAID cache moot.

  > > I do not recommend ext3 for anything over about 120GB.
  >
  > OK, that's interesting.
  >

  I work with compute cluster and with file systems that are in the
  terabytes of size. Basically nothing else has come close to XFS in
  practice.  The guys that admin the really big stuff that we collaborate
  with will not touch anything but XFS and they have petabytes of storage.
  If you can go with XFS. Even RHEL4 should finally have XFS standard
  since fedora core 2 and later has it as an option, even for the root disk.

  > > My biggest question is why at this point are you even bothering with
  > PATA drives? Compared to SATA drives they are unreliable and poor
  > performing for about the same cost.
  > This is what I get from my ISP (I think). However it doesn't change a
  > lot my conception and thoughts about raid. The flush problem remains
  > the same.
  > Also I am more familiar to PATA.

  The flush problem as I hope I have demonstrated is not at all addressed
  by battery backups on RAID card ram. Battery based backup of RAID memory
  is a good gimmick, but in practice is useless. It covers such a narrow
  part of the problem space as to be irrelevant.

  If you are concerned that power failure will loose data get a UPS for
  the entire system. It is the only thing that will help you because it is
  the only thing that will allow your entire system, from software to
  hardware to achieve a consistent state before shutting down. Otherwise
  you may save a few bytes of data that was in cache on the RAID card but
  that does not matter since you will still end up with an incomplete file
  system transaction that the journaled file system is going to delete
  anyway.

  The linux journaled file systems are very good at preserving integrity,
  even in the face of underlying hardware failure in some cases. Choosing
  a good file system is all you need to do there to ensure that aspect. As
  far as data loss, aside from backups after the fact the only solution
  that will work in practice is Uninteruptible Power Supplies that will
  give you enough time to shutdown the entire system in a consistent way.

  Terrence

  >
  > Thank you for your interesting advices. I appreciate that !
  >
  > Best regards,
  >
  > Daniel
  >
  >
  >

------=_NextPart_000_0042_01C4C309.0CA4DEB0
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.2800.1400" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2>Thanks again for those answers which =
I'm sure will=20
also interest other people.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>If I understand correctly, I can use =
software RAID=20
with no disadvantages over hardware RAID as for reliability in cases of =
power=20
failures (as long as we&nbsp;use journalised filesystem), is that =
correct=20
?</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>I'm not in fear about he CPU increase =
for software=20
RAID, that isn't a problem to me.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>What else do you think&nbsp;I would =
loose by using=20
software RAID instead of hardware raid ?</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Also, sorry to insist about power =
failures, but=20
here is the reason: my provider allows me to have remote reboot on my =
server,=20
which is very helpful when the system hangs suddenly as I have no =
physical=20
access. Curently I must do a hardware reboot 1-2 times per year and the =
system=20
starts and runs again correctly. These remote reboots are power failure =
types=20
and are more and more frenquently offered to customers. It avoids us to =
ask the=20
technician to reboot the machine.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Your professionnal experience is very =
instructive.=20
All you said about the flush problem is thrilling. Do you mean then that =
having=20
battery bakup RAID will not help much regarding reliability in cases of =
power=20
failures ? Then many people have been tricked ?</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Last: is there a mean of knowing what =
datas have=20
been lost when a power failure happened ? Can we see that in terms of =
sectors,=20
clusters or in terms of files which are corrupted ? Can we see a list of =
those=20
?</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Thank you a lot, Terrence =
!</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Daniel</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<BLOCKQUOTE=20
style=3D"PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; =
BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
  <DIV style=3D"FONT: 10pt arial">----- Original Message ----- </DIV>
  <DIV=20
  style=3D"BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: =
black"><B>From:</B>=20
  <A title=3Dtmartin@xxxxxxxxxxxxxxxx=20
  href=3D"mailto:tmartin@xxxxxxxxxxxxxxxx";>Terrence Martin</A> </DIV>
  <DIV style=3D"FONT: 10pt arial"><B>To:</B> <A =
title=3Ddan1@xxxxxxxxxxxx=20
  href=3D"mailto:dan1@xxxxxxxxxxxx";>dan1</A> </DIV>
  <DIV style=3D"FONT: 10pt arial"><B>Sent:</B> Friday, November 05, 2004 =
2:01=20
  AM</DIV>
  <DIV style=3D"FONT: 10pt arial"><B>Subject:</B> Re: [Centos] Promise =
raid cards=20
  - software raid</DIV>
  <DIV><FONT face=3DArial size=3D2></FONT><BR></DIV>dan1 =
wrote:<BR><BR>&gt; Hello,=20
  Terrence.<BR>&gt;&nbsp; <BR>&gt; Thank you for your complete answers. =
That's=20
  very interesting.<BR>&gt;&nbsp; <BR>&gt;<BR>&gt; &gt; I am not sure =
what you=20
  mean about the file system crashing?<BR>&gt;&nbsp; <BR>&gt; I meant =
that it=20
  becomes unrecoverable, or that some datas are missing.<BR>&gt;&nbsp;=20
  <BR><BR>These are two seperate problems with two seperate=20
  solutions.<BR><BR>You can have loss of data without the file system =
having any=20
  problems. <BR>That is data is missing, perfectly working file system. =
In fact=20
  <BR>journaled file systems pick preserving the file system over =
missing data=20
  <BR>every time.<BR><BR>Eg.<BR><BR>You have a database and are writing =
a query=20
  to the database that inerts <BR>1000 records. If you do not insert all =
1000=20
  records then you cannot use <BR>any of them. That is record 1000 =
depends on=20
  all the previous 999.&nbsp; On <BR>record 678 the power fails. When =
the system=20
  comes back what happened to <BR>the first 678 inserts? You assume they =

  complete but since you need all <BR>1000 for the data to be valid you=20
  basically have to delete the ones you <BR>did insert and start again, =
if you=20
  can.<BR><BR>Deleting means that you first have to know what records to =

delete. Depending on the complexity of the inserts (they could =
touch=20
 dozens of tables and interrelationships) you could have a lot of =
work=20
 ahead of you to manually find what is a record that is part of =
that=20
 incomplete set and what records are not. This is where a =
journal=20
 comes in. A journal records when data is written. Basically you =
record=20
 what you write after you finish writing a record so that worst =
case you=20
 can replay that journal of changes to back out of what happened. =
This is=20
 what a journaled file system does. It allows you to more easily =
back out=20
 of the incomplete data problem quickly. This is opposed to the old =
method=20
 of file system consistency checking which was like a manual search =
of the=20
 entire database. If you only have to go over the changes, rather =
than=20
 searching the whole database it is faster.&nbsp; Also the long =
manual=20
 check is complicated and prone to error. For speed and accuracy a =
journal=20
 is better. However journaled file systems do not save the data. =
In fact=20
 journaled file systems will throw away data if it is incomplete. =
Say in=20
 the above example for some reason that you could use the first 678 =

records, or that there was no other way to recover the 1000 =
records again=20
 so 678 was better than nothing. Well a journal does not care. It =
simply=20
 looks to see if the whole transactions completed of 1000 records. =
If it=20
 didn't it deletes everything up to the failure. Even if 999 of the =
1000=20
 records was written it will still delete the 999. The assumption =
being=20
 that it is better to have a consistent file system and protect the =
good=20
 data than have partially written data, that while valuable is=20
 inconsistent. If you want to ensure that even partial data is =
preserved=20
 you have to do other things to protect it.&nbsp; A battery protect =
RAID=20
 card is one very very narrow approach that solves one specific =
type of=20
 failure state. Where data is written to the RAID card but not to =
disk=20
 yet. Lets go back to our above example You have a power =
failure=20
 on record 678. The raid card has memory to store 5 records. At the =
time of=20
 the power failure it has only sent the first 673 records to the =
disks for=20
 writting. The other 5 are in the controller cache. If you have a =
battery=20
 on that memory you will save those 5 records. However does it =
matter?=20
 Afterall 678 or 673 they are still not 1000. Also the disks =
themselves may=20
 store in their cache 2 records. So the disk has only written =
records up to=20
 670 with records 671 and 672 still waiting in volatile RAM with no =
battery=20
 backup attached directly to the disk (write back cache on all PATA =

drives). The power fails and the system comes back online. The =
RAID=20
 card writes records 673-678 to the disks and they write them.=20
 Unfortunately records 671 and 672 are lost because they were in =
volatile=20
 disk cache on the disk itself. So you have records =
0-670,637-678.=20
 You in fact have a hole in what you have on disk, and who cares =
anyway=20
 because the journaled file system is going to delete all those =
records=20
 when it goes to work to ensure integrity over data=20
 preservation. Basically RAID batteries buy you something, but =
not much,=20
 and they buy you even less when they are attached to ATA drives =
that have=20
 write back cache that essentially makes the RAID cache =
moot. &gt;=20
 &gt; I do not recommend ext3 for anything over about =
120GB. &gt;&nbsp;=20
 &gt; OK, that's interesting. &gt;&nbsp; I work with =
compute=20
 cluster and with file systems that are in the terabytes of size. =
Basically=20
 nothing else has come close to XFS in practice.&nbsp; The guys =
that admin=20
 the really big stuff that we collaborate with will not touch =
anything but=20
 XFS and they have petabytes of storage. If you can go with XFS. =
Even RHEL4=20
 should finally have XFS standard since fedora core 2 and later has =
it as=20
 an option, even for the root disk. &gt; &gt; My biggest =
question is why=20
 at this point are you even bothering with &gt; PATA drives? =
Compared to=20
 SATA drives they are unreliable and poor &gt; performing for about =
the same=20
 cost. &gt; This is what I get from my ISP (I think). However it =
doesn't=20
 change a &gt; lot my conception and thoughts about raid. The flush =
problem=20
 remains &gt; the same. &gt; Also I am more familiar to =
PATA. The=20
 flush problem as I hope I have demonstrated is not at all addressed =
 by=20
 battery backups on RAID card ram. Battery based backup of RAID memory =
 is a=20
 good gimmick, but in practice is useless. It covers such a narrow =
 part of=20
 the problem space as to be irrelevant. If you are concerned =
that power=20
 failure will loose data get a UPS for the entire system. It is the =
only=20
 thing that will help you because it is the only thing that will =
allow your=20
 entire system, from software to hardware to achieve a consistent =
state=20
 before shutting down. Otherwise you may save a few bytes of data =
that was=20
 in cache on the RAID card but that does not matter since you will =
still=20
 end up with an incomplete file system transaction that the =
journaled file=20
 system is going to delete anyway. The linux journaled file =
systems=20
 are very good at preserving integrity, even in the face of =
underlying=20
 hardware failure in some cases. Choosing a good file system is all =
you=20
 need to do there to ensure that aspect. As far as data loss, aside =
from=20
 backups after the fact the only solution that will work in =
practice is=20
 Uninteruptible Power Supplies that will give you enough time to =
shutdown=20
 the entire system in a consistent =
way. Terrence &gt;&nbsp;=20
 &gt; Thank you for your interesting advices. I appreciate that=20
 ! &gt;&nbsp; &gt; Best regards, &gt;&nbsp; &gt;=20
 Daniel &gt;&nbsp; &gt;&nbsp; &gt;&nbsp;=20
 </BLOCKQUOTE></BODY></HTML>

------=_NextPart_000_0042_01C4C309.0CA4DEB0--