Re: reboot long uptimes?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



Drew Weaver wrote:
 -----Original Message-----
From: centos-bounces@xxxxxxxxxx [mailto:centos-bounces@xxxxxxxxxx] On Behalf Of Johnny Hughes
Sent: Tuesday, February 13, 2007 6:30 AM
To: CentOS ML
Subject: Re:  reboot long uptimes?

On Tue, 2007-02-13 at 12:06 +0100, D Ivago wrote:
Hi,

I was just wondering if I should reboot some servers that are running

over 180 days?

They are still stable and have no problems, also top shows no zombie processes or such, but maybe it's better for the hardware (like ext3

disk checks f.e.) to reboot  every six months...
About the only other reason I can think of is just to make sure it will
restart when an emergency arises.

For instance, fans, drives, etc.....

Some servers will balk if a fan doesn't run. Some servers balk if a hard
drive isn't up to speed. These types of things only show up during a
reboot. In the case of scsi raids, hot swap drives... if a drive goes
bad some equipment will require some action for the boot up to
continue.. some don't.

For instance, considering RAID5 hot swappable....

If it's one drive on a raid, no biggie.. if it's two and you don't have
hot spares.. that is a bigger issue. 'Scheduled' reboots, like when a
new kernel comes out and you have time to be there and do something or
have someone there if needed... it is a good time to be sure the self
checks done by the server pass.

Basically, the longer the time before reboots, the more likely a error
will occur. And it would be really bad if three or four of your drives
suddenly didn't have enough strength to get up to speed... better that
it is only one which can be easily swapped out.

--

That's not really statistically accurate.

X event occuring or not occuring has no probable impact on whether
random event Y occurs.

Where X = rebooting, and y = 'something funky'.

Something funky could happen 5 minutes after the system starts, or 5
years.

-Drew
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos

Drew - I don't think you are correct about those events being independent.

'X' isn't *rebooting* , it is the number of days *between* rebooting.

define/confirm some of the terms;
"drive error" - errors which don't kill the drive immediately, but lurk until the next bounce.
"Reboot period" - number of days between each reboot.

If the probability of a drive failing in any one 24h period is 'p', then the probability of a drive failing in a reboot period of 7 days (ie for a Windows Vista server ;) ) is 7p.
If the reboot period is one year, the probability is 365p

One point John is making (i think!) is that, particularly with raid arrays, dealing with drive errors one at a time is easier than waiting until there are multiple.

The point at question ; How does a long reboot period contribute to the probability of >1 drive errors occurring at any boot event.

Statistically i believe the following is true.
If (as in the above example) the probability of 1 drive failing is;
p(1 drive failing)=365p
assuming independent probability then the probability of 2 drives failing is;
p(2 drives failing)=365p * 365p
or
days^2 * p^2

compare this to a 1day reboot period (ie an MS Exchange box?)
p(1 drive failing)=p
p(2 drives failing)=p^2

So the probability of 'problems' (ie one drive failing) is linear w respect to reboot period (days times p) The probability of 'disaster' (ie two drives failing) is massively higher with long reboot periods - 133,000 times higher for 365 days then 1 day.
Of course 'p' is a very low number we hope!

These are the same calcs as for failure in RAID arrays - as non-intuitive as it may be, more drives in your array means a *greater* risk of a (any) drive failure - however you can of course mitigate the *effect* of this easily with hot-spares.

Food-for-thought?
How do we mitigate the effect of multiple failures of this type? Imagine the situation where a box has been running for 10 years. We have to expect that the box will not keep BIOS time during a cold reboot - not a problem with ntp. What about BIOS on mobo/video cards/BIOS on Raid etc - can NVRAM be trusted to be 'NV' after 10 years of being hot?
Obviously data is safe because of our meticulous backups???

Regards,

MrKiwi.




_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos

[Index of Archives]     [CentOS]     [CentOS Announce]     [CentOS Development]     [CentOS ARM Devel]     [CentOS Docs]     [CentOS Virtualization]     [Carrier Grade Linux]     [Linux Media]     [Asterisk]     [DCCP]     [Netdev]     [Xorg]     [Linux USB]
  Powered by Linux