> On Tue, Sep 29, 2009 at 09:45:53AM +0200, Simon Matter wrote: >> What I'm really wondering, what filesystem disasters have others seen? >> How >> many times was it fsck only, how many times was it really broken. I'm >> not >> talking about laptop and desktop users but about production systems in a >> production environment with production class hardware and operating >> systems. > > Well - we lost three drives in a 3TB RAID6 partition within 24 hours. > That was sad. The third drive wasn't totally lost, just throwing enough > errors that we remounted the whole thing readonly and kept it around to > supplement the backup restores. > >> Would be really interesting to get some of the good and bad stories even >> if not directly related to Cyrus-IMAP. > > Honestly, the biggest thing is - I've got a unit I've just switched drives > in. It has 4 x 300GB 15kRPM drives in two RAID1 sets, and 8 x 2TB drives > in two RAID5 sets. That's 12TB of data space plus a bit of room for meta. > > Those 2TB drives spin at 7k2RPM, that's not that fast. It takes weeks to > fill one of those things, and weeks again to copy data off. > > Once you start talking multi-day downtimes to restore data, that's when > your > customers take their business elsewhere, and fair enough. Ok if you're a > university or business with a captive customer base, but not so nice if > you're trying to keep customers! The interesting point is that the discussion started as a ZFS vs. $ANY_OTHER_FS thing but it quickly turns out that the filesystem is only one part of the picture. If your storage fails on the block level I doubt the filesystem matters that much. One of the biggest issues is the cheap big drives which are put together into huge RAID arrrays. There is a good chance that if one disk fails, errors show up on another disk. What I do with Linux software raid is to split every big disk into smaller chunks like with a 500G disk I create 10x50G segments on it. The I create independant RAID devs over every segment of each disk. The whole RAID segments are then put into LVM volgroups. That prevents a disk from getting kicked out of the RAID completely if only a small part of the disk is defect. IIRC ZFS does something which in the end has similar effects and also AIX SoftRAID does someting like that. I end here before getting too OT. Simon ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html