Re: Fwd: Re: SSDD reliability

Toby Corkindale <toby.corkindale@xxxxxxxxxxxxxxxxxxxx> · Thu, 19 May 2011 11:10:12 +1000

On 19/05/11 10:50, mark wrote:
Note 1:
I have seen an array that was powered on continuously for about six
years, which killed half the disks when it was finally powered down,
left to cool for a few hours, then started up again.

Recently we rebooted about 6 machines that had uptimes of 950+ days.
Last time fsck had run on the file systems was 2006.

When stuff gets that old, has been on-line and under heavy load all that
time you actually get paranoid about reboots. In my newly reaffirmed
opinion, at that stage reboots are at best a crap shoot. We lost several
hours to that gamble more than we had budgeted for. HP is getting more of
their gear back than in a usual month.

I worked at one place, years ago, which had an odd policy.. They had 
automated hard resets hit all their servers on a Friday night, every week.
I thought they were mad at the time!

But.. it does mean that people design and test the systems so that they 
can survive unattended resets reliably. (No one wants to get a support 
call at 11pm on Friday because their server didn't come back up.)

It still seems a bit messed up though - even if friday night is a 
low-use period, it still means causing a small amount of disruption to 
customers - especially if a developer or sysadmin messed up, and a 
server *doesn't* come back up.

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general