Re: Implement Cyrus IMAPD in High Load Enviromment

"Simon Matter" <simon.matter@xxxxxxxxx> · Tue, 29 Sep 2009 09:45:53 +0200

> Bron Gondwana wrote:
>> I assume you mean 500 gigs!  We're switching from 300 to 500 on new
>> filesystems because we have one business customer that's over 150Gb
>> now and we want to keep all their users on the one partition for
>> folder sharing.  We don't do any murder though.
>>
>>
> Oops yes.  I meant 500 gigs.  The potential downside of
> running an fsck on  terabyte+ filesystems is not worth
> the risks IMO.  The tremendous speed & efficiency of
> Cyrus is in it's small files and the indexes.  However you
> have to keep that in mind when estimating not just backups
> and other daily/weekly items but more serious items.
>
> Really I've looked at fsck too many times in my life and
> don't ever want to again.  Anyone who tells me "oh yes but
> journalling solved all that long ago...." will get an earful
> from me about how they haven't run a big enough setup
> with enough stress on it to SEE real problems.  I have seen
> both journalled Linux and logged Solaris filesystem turn up
> with data corruption and ended up staring at that fsck
> prompt wondering how many hours until it's done.....
>
> The antiquated filesystems that 99% of admins tolerate and
> work with every day should be lumped under some kind of
> Geneva provision against torture.  It's a mystery to me why
> it's not resolved years ago and why there isn't a big push
> for it from anyone.
>
> "It doesn't matter how fast it is, if it isn't CORRECT!" should
> be some kind of mantra for a production data center but it
> still seems majority of my colleagues talk same as in 1980s'
> about how if we turn off this or that safety feature we can
> make the filesystem faster.
>
> OK stepping off my soapbox now.

What you said is not wrong, but it matters how you look at it.

It's true, looking at an fsck prompt is something very boring and it can
make one very nervous. But after many years of Unix and Linux experience
it doesn't look _so_ bad considering the issues people have in the non *X
world.

I have seen much less fsck in the last 10 years than before. I have seen
it with all kind of traditional Unix filesystems on HP/UX, Solaris, AIX
and SCO Unix, I have seen it on Linux with ext2, then with ext3 and
reiserfs. Linux with XFS has shown almost no issues (despite having bad
support from my main distributor RedHat and some bad behavior in earlier
releases).

What is really bad is if you end up with a broken filesystem which can not
be fixed anymore, corrupt, dead, disaster. The bad news is, it seems to be
possible with every filesystem, more or less. It's because software can
have bugs and so it can do something wrong. Features like checksumming
data and metadata are nice but don't prevent from the worst.

I really hope we could see something like ZFS on other platforms (I know
there are already implementations on *BSD but I'm not sure they are as
stable as on Solaris). As it is now if you choose to use ZFS your are
limited to Solaris. And from my point of view it still greatly lacks
features in other areas. That said, you always have to choose what is
important for you and find a way to work around the disadvantages of
choosen solution.

What I'm really wondering, what filesystem disasters have others seen? How
many times was it fsck only, how many times was it really broken. I'm not
talking about laptop and desktop users but about production systems in a
production environment with production class hardware and operating
systems.

Would be really interesting to get some of the good and bad stories even
if not directly related to Cyrus-IMAP.

Regards,
Simon

----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html