Simon Matter wrote:
What I'm really wondering, what filesystem disasters have others seen? How many times was it fsck only, how many times was it really broken. I'm not talking about laptop and desktop users but about production systems in a production environment with production class hardware and operating systems. Would be really interesting to get some of the good and bad stories even if not directly related to Cyrus-IMAP.
So we ran UFS (with logging) on multiple UW-IMAP backends before moving to Cyrus. I can tell you at LEAST half a dozen times we would have some hardware or software crash that would leave someone looking at this: fsck /var/mail Y/N? The "correct" answer is Y but then you have hours and hours of downtime so sometimes you say N and cross your fingers. We had one system someone hit N and left it that way for weeks not know if it was going to develop cancer at any moment, until we could migrate users off it. It seemed working OK but we had no way to verify that while "hot" and no downtime available in the intervening perid so we crossed fingers..... Since I've started working here at UC Davis in 2005 I've seen double-disk failures in a RAID-5 set THREE TIMES when I had never seen it in previous 15 years. I've seen double-controller RAID arrays go into total lockup when one controller failed and the code that was supposed to switch smoothly to other controller didn't work. What's going on inside that black-box array controller? Who knows. The original developer is long gone and all the replacements that upgraded it over the years don't really know how it all works. It's often astonishing to me that Linux admins will use hardware controllers and even EMC sans for quite large datasets and blindly trust the black box. RAID6? I am a member of BAARF. RAID5/6 are not to be trusted. See http://www.baarf.com/ So yes I'm the paranoid soul, that if you hand me RAID6 LUNs from an EMC SAN device, I will ZFS mirror them together for additional safety on top of that since I know from experience I cannot trust the black boxes to do what they claim. Really I'm not trying to beat anyone over the head with ZFS particularly, I'm just stating that currently it's the only filesystem I can use in production for large datasets that I actually TRUST. I like a lot being able to once in a while when I replace a disk go "zpool scrub" even during peak usage hours and KNOW it's all correct. When Linux has something similar I'll use it in a second. Until then I prefer Linux for app servers and Solaris for back-end storage. YMMV.
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html