RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)

"Guy" <bugzilla@xxxxxxxxxxxxxxxx> · Mon, 3 Jan 2005 12:34:16 -0500

Having a filesystem go into read only mode is a "down system".  Not
acceptable to me!  Maybe ok for a home system, but I don't assume Linux is
limited to home use.  In my case, this is not acceptable for my home system.
Time is money!

About user intervention.  If the system stops working until someone does
something, that is a down system.  That is what I meant by user
intervention.  Replacing a disk Monday that failed Friday night, is what I
would expect.  This is a normal failure to me.  Even if a re-boot is
required, as long as it can be scheduled, it is acceptable to me.

You and I have had very different failures over the years!
In my case, most failures are disks, and most of the time the system
continues to work just fine, without user intervention.  If spare disks are
configured, the array re-builds to the spare.  At my convenience, I replace
the disk, without a system re-boot.  Most Unix systems I have used have SCSI
disks.  IDE tends to be in home systems.  My home system is Linux with 17
SCSI disks.  I have replaced a disk without a re-boot, but the disk cabinet
is not hot-swap, so I tend to shut down the system to replace a disk.

My 20 systems had anywhere from 4 to about 44 disks.  You should expect 1
disk failure out of 25-100 disks per year.  There are good years and bad!
Our largest customer system has more than 300 disks.  I don't know the
failure rate, but most failures do not take the system down!  Our customer
systems tend to have hardware RAID systems.  HP, EMC, DG (now EMC).

If you have a 10% disk failure rate per year, something else is wrong!  You
may have a bad building ground, or too much current flowing on the building
ground line.  All sorts of power problems are very common.  Most if not all
electricians only know the building code.  They are not qualified to debug
all power problems.  I once talked to an expert in the field.  He said
thunder causes more power problems than lighting!  Most buildings use
conduit for ground, no separate ground wire.  The thunder will shake the
conduit and loosen the connections.  This causes a bad ground during the
thunder, which could crash computer systems (including hardware RAID boxes).
Never depend a conduit for ground, always have a separate ground wire.  This
is just one example of many issues he had, I don't recall all the details,
and I am not an expert on building power.

I know of 1 event that matches your most common failure.  A PC with a $50
case and power supply, the power supply failed in such a way that it put
120V on the 12V and/or 5V line.  Everything in the case was lost!  Well, the
heat sink was ok. :)  The system was not repaired, it went into the trash.
But this was a home user's clone PC, not a server.

Guy

-----Original Message-----
From: linux-raid-owner@xxxxxxxxxxxxxxx
[mailto:linux-raid-owner@xxxxxxxxxxxxxxx] On Behalf Of Peter T. Breuer
Sent: Monday, January 03, 2005 6:32 AM
To: linux-raid@xxxxxxxxxxxxxxx
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

Guy <bugzilla@xxxxxxxxxxxxxxxx> wrote:
> "Also sprach Guy:"
> > "Well, you can make somewhere. You only require an 8MB (one cylinder)
> > partition."
> > 
> > So, it is ok for your system to fail when this disk fails?
> 
> You lose the journal, that's all.  You can react with a simple tune2fs
> -O ^journal or whatever is appropriate.  And a journal is ONLY there in
> order to protect you against crashes of the SYSTEM (not the disk), so
> what was the point of having the journal in the first place? 
> 
> ** When you lose the journal, does the system continue without it?
> ** Or does it require user intervention?

I don't recall. It certainly at least puts itself into read-only mode
(if that's the error mode specified via tune2fs). And the situation
probably changes from version t version.

On a side note, I don't know why you think user intervention is not
required when a raid system dies.  As a matter of liklihoods, I have
never seen a disk die while IN a working soft (or hard) raid system, and
the system continue working afterwards, instead the normal disaster
sequence as I have experienced it is:

   1) lightning strikes rails, or a/c goes out and room full of servers
      overheats. All lights go off.

   2) when sysadmin arrives to sort out the smoking wrecks, he finds
      that 1 in 3 random disks are fried - they're simply the points
      of failure that died first, and they took down the hardware with
      them.

   3) sysadmin buys or jury-rigs enough pieces of nonsmoking hardware
      to piece together the raid arrays from the surviving disks, and
      hastily does a copy to somewhere very safe and distant, while
      an assistant holds off howling hordes outside the door with
      a shutgun.

In this scenario, a disk simply acts as the weakest link in a fuse
chain, and the whole chain goes down.  But despite my dramatisation it
is likely that a hardware failure will take out or damage your hardware!
Ide disks live on an electric bus conected to other hardware.  Try a
shortcircuit and see what happens.  You can't even yank them out while
the bus is operating if you want to keep your insurance policy.

For scsi the situation is better wrt hot-swap, but still not perfect.
And you have the electric connections also. That makes it likely that a
real nasty hardware failure will do nasty things (tm) to whatever is in
the same electric environment.  It is possible if not likely that you
will lose contact with scsi disks further along the bus, if you don't
actually blow the controller.

That said, there ARE situations which raid protects you from - simply a
"gentle disconnect" (a totally failed disk that goes open circuit), or
a "gradual failure" (a disk that runs out of spare sectors). In the
latter case the raid will fail the disk completely at the first
detected error, which may well be what you want (or may not be!).

However, I don't see how you can expect to replace a failed disk
without taking down the system. For that reason you are expected to be
running "spare disks" that you can virtually insert hot into the array
(caveat, it is possible with scsi, but you will need to rescan the bus,
which will take it out of commission for some seconds, which may
require you to take the bus offline first, and it MAY be possible with
recent IDE buses that purport to support hotswap - I don't know).

So I think the relevant question is: "what is it that you are protecting
yourself from by this strategy of yours".

When you have the scenario, you can evaluate risks.

> > I don't want system failures when a disk fails,
> 
> Your scenario seems to be that you have the disks of your mirror on the
> same physical system.  That's fundamentally dangerous.  They're both
> subject to damage when the system blows up.  [ ... ]  I have an array
> node (where the journal is kept), and a local mirror component and a
> remote mirror component.
> 
> That system is doubled, and each half of the double hosts the others
> remote mirror component. Each half fails over to the other.

> ** So, you have 2 systems, 1 fails and the "system" switches to the other.
> ** I am not going for a 5 nines system.
> ** I just don't want any down time if a disk fails.

Well,

(1) how likely is it that a disk will fail without taking down the system
(2) how likely is it that a disk will fail
(3) how likely is it that a whole system will fail

I would say that (2) is about 10% per year. I would say that (3) is
about 1200% per year. It is therefore difficult to calculate (1), which
is your protection scenario, since it doesn't show up very often in the
stats!

> ** A disk failing is the most common failure a system can have (IMO).

Not in my experience. See above. I'd say each disk has about a 10%
failure expectation per year. Whereas I can guarantee that an
unexpected  system failure will occur about once a month, on every
important system.

If you think about it that is quite likely, since a system is by
definition a complicated thing. And then it is subject to all kinds of
horrible outside influences, like people rewiring the server room in 
order to reroute cables under the floor instead of through he ceiling,
and the maintenance people spraying the building with insecticide,
everywhere, or just "turning off the electricity in order to test it"
(that happens about four times a year here - hey, I remember when they
tested the giant UPS by turning off the electricity! Wrong switch.
Bummer).

Yes, you can try and keep these systems out of harms way on a
colocation site, or something, but by then you are at professional
level paranoia. For "home systems", whole system failures are far more
common than disk failures.

I am not saying that RAID is useless! Just the opposite. It is a useful
and EASY way of allowing you to pick up the pieces when everything
falls apart. In contrast, running a backup regime is DIFFICULT.

> ** In a computer room with about 20 Unix systems, in 1 year I have seen 10
> or so disk failures and no other failures.

Well, let's see. If each system has 2 disks, then that would be 25% per
disk per year, which I would say indicates low quality IDE disks, but
is about the level I would agree with as experiential.

> ** Are your 2 systems in the same state?

No, why should they be?

> ** They should be at least 50 miles apart (at a minimum).

They aren't - they are in two different rooms. Different systems copy
them every day to somewhere else. I have no need for instant
survivability across nuclear attack.

> ** Otherwise if your data center blows up, your system is down!

True. So what? I don't care. The cost of such a thing is zero, because
if my data center goes I get a lot of insurance money and can retire.
The data is already backed up elsewhere if anyone cares.

> ** In my case, this is so rare, it is not an issue.

It's very common here and everywhere else I know! Think about it - if
your disks are in the same box then it is statistically likely that when
one disks fails it is BECAUSE of some local cause, and that therefore
the other disk will also be affected by it.

It's your "if your date center burns down" reasoning, applied to your
box.

> ** Just use off-site tape backups.

No way! I hate tapes. I backup to other disks.

> ** My computer room is for development and testing, no customer access.

Unfortunately, the admins do most of the sabotage.

> ** If the data center is gone, the workers have nowhere to work anyway (in
> my case).

I agree. Therefore who cares. OTOH, if only the  server room smokes
out, they have plenty of places to work, but nothing to work on.
Tut tut.

> ** Some of our customers do have failover systems 50+ miles apart.

Banks don't (hey, I wrote the interbank communications encryption
software here on the peninsula) here.  They have tapes.  As far as I
know, the tapes are sent to vaults. It often happens that their systems
go down. In fact, I have NEVER managed to connect via the internet page
to their systems at a time when they were in working order. And I have
been trying on and off for about three years. And I often have been in
the bank managers office (discussing mortgages, national debt, etc.)
when the bank's internal systems have gone down, nationwide.

> > so mirror (or RAID5)
> > everything required to keep your system running.
> > 
> > "And there is a risk of silent corruption on all raid systems - that is
> > well known."
> > I question this....
> 
> Why!
> ** You lost me here.  I did not make the above statement. 

Yes you did. You can see from the quoting that you did.

> But, in the case
> of RAID5, I believe it can occur.

So do I. I am asking why you "question this">

> Your system crashes while a RAID5 stripe
> is being written, but the stripe is not completely written. 

This is fairly meaningless.  I don't now what precise meaning the word
"stripe" has in raid5 but it's irrelevant.  Simply, if you write
redundant data, whatever way you write it, raid 1, 5, 6 or whatever,
there is a possibility that you write only one of the copies before the
system goes down. Then when the system comes up it has two different 
sources of data to choose to believe.

> During the
> re-sync, the parity will be adjusted,

See. "when the system comes up ...".

There is no need to go into detail and I don't know why you do!

> but it may be more current than 1 or
> more of the other disks.  But this would be similar to what would happen
to
> a non-RAID disk (some data not written).

No, it would not be similar. You don't seem to understand the
mechanism. The mechanism for corruption is that there are two different
versions of the data available when the system comes back up, and you
and the raid system don't know which is more correct. Or even what it
means to be "correct". Maybe the earlier written data is "correct"!

> ** Also with RAID1 or RAID5, if corruption does occur without a crash or
> re-boot, then a disk fails, the corrupt data will be copied to the
> replacement disk.

Exactly so. It's a generic problem with redundant data sources. You
don't know which one to believe when they disagree!

> With RAID1 a 50% risk of copying the corruption, and 50%
> risk of correcting the corruption.  With RAID5, risk % depends on the
number
> of disks in the array.

It's the same.  There are two sources of data that you can believe.  The
"real data on disk" or "all the other data blocks in the 'stripe' plus
the parity block".  You get to choose which you believe.

> > I bet a non-mirror disk has similar risk as a RAID1.  But with a RAID1,
you
> 
> The corruption risk is doubled for a 2-way mirror, and there is a 50%
> chance of it not being detected at all even if you try and check for it,
> because you may be reading from the wrong mirror at the time you pass
> over the imperfection in the check.
> ** After a crash, md will re-sync the array.

It doesn't know which disk to believe is correct.

There is a stamp on the disks superblocks, but it is only updated every
so often. If the whole system dies while both disks are OK, I don't know 
what will be stamped or what will happen (which will be believed) at
resync. I suspect it is random. I would appreciate clarificaton from
Neil.

> ** But during the re-sync, md could be checking for differences and
> reporting them.

It could. That might be helpful.

> ** It won't help correct anything, but it could explain why you may be
> having problems with your data.

Indeed, it sounds a good idea. It could slow down RAID1 resync, but I
don't think the impact on RAID5 would be noticable.

> ** Since md re-syncs after a crash, I don't think the risk is double.

That is not germane. I already pointed out that you are 50% likely to
copy the "wrong" data IF you copy (and WHEN you copy).  Actually doing
the copy merely brings that calculation into play at the moment of the
resync, instead of later, at the moment when one of the two disks
actually dies and yu have to use the remaining one.

> Isn't that simply the most naive calculation? So why would you make
> your bet?
> ** I don't understand this.

Evidently! :)

> And then of course you don't generally check at all, ever.
> ** True, But I would like md to report when a mirror is wrong.
> ** Or a RAID5 parity is wrong.

Software raid does not spin off threads randomly checking data. If you
don't use it, you don't get to check at all. So just leaving disks
sitting there exposes them to corruption that is checked least of all.

> But whether you check or not, corruptions simply have only a 50% chancce
> of being seen (you look on the wrong mirror when you look), and a 200%
> chance of occuring (twice as much real estate) wrt normal rate.
> ** Since md re-syncs after a crash, I don't think the risk is double.

It remains double whatever you think. The question is whether you
detect it or not. You cannot detect it without checking.

> ** Also, I don't think most corruption would be detectable (ignoring a
RAID
> problem).

You wouldn't know which disk was right. The disk might know if it was a
hardware problem

Incidentally, I wish raid would NOT offline the disk when it detects a
read error. It should fall back to the redundant data. 

I may submit a patch for that.

In 2.6 the raid system may even do that. The resync thread comments SAY
that it retries reads. I don't know if it actually does.

Neil?

> ** It depends to the type of data.
> ** Example: Your MP3 collection would go undetected until someone listened
> to the corrupt file. 

:-).

> In contrast, on a single disk they have a 100% chance of detection (if
> you look!) and a 100% chance of occuring, wrt normal rate.
> ** Are you talking about the disk drive detecting the error?

No. You are quite right. I should categorise the types of error more
precisely. We want to distinguish

    1) hard errors (detectable by the disk firmware)
    2) soft errors (not detected by the above)

> ** If so, are you referring to a read error or what?

Read.

> ** Please explain the nature of the detectable error.

"wrong data on the disk or as read from the disk". Define "wrong"!

> > know when a difference occurs, if you want.
> 
> How?
> ** Compare the 2 halves or the RAID1, or check the parity of RAID5.

You wouldn't necesarily know  which of the two data sources was
"correct".

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html