On Thu, 08 Jan 2009 20:03 -0500, "Dale Ghent" <daleg@xxxxxxxxxxxxx> wrote: > On Jan 8, 2009, at 7:46 PM, Bron Gondwana wrote: > > > We run one zfs machine. I've seen it report issues on a scrub > > only to not have them on the second scrub. While it looks shiny > > and great, it's also relatively new. > > Wait, weren't you just crowing about ext4? The filesystem that was > marked GA in the linux kernel release that happened just a few weeks > ago? You also sound pretty enthusiastic, rather than cautious, when > talking about brtfs and tux3. I was saying I find it interesting. I wouldn't seriously consider using it for production mail stores just yet. But I have been testing it on my laptop, where I'm running an offlineimap replicated copy of my mail. I wouldn't consider btrfs for production yet either, and tux3 isn't even on the radar. They're interesting to watch though, as is ZFS. I also said (or at least meant) that if you have commercial support, ext4 is probably going to be the next evolutionary step from ext3. > ZFS, and anyone who even remotely seriously follows Solaris would know > this, has been GA for 3 years now. For someone who doesn't have their > nose buried in Solaris much or with any serious attention span, I > guess it could still seem new. Yeah, it's true - but I've heard anecdotes of people losing entire zpools due to bugs. Google turns up things like: http://www.techcrunch.com/2008/01/15/joyent-suffers-major-downtime-due-to-zfs-bug/ which points to this thread: http://www.opensolaris.org/jive/thread.jspa?threadID=49020&tstart=0 and finally this comment: http://www.joyeur.com/2008/01/16/strongspace-and-bingodisk-update#c008480 Not something I would want happening to my entire universe, which is why having ~280 separate filesystems (at the moment) with our email spread across them means that a rare filesystem bug is only likely to affect a single store if it bites - and we can restore one store's worth of users a lot quicker than the whole system. It's the same reason we prefer Cyrus replication (and put a LOT of work into making it stable - check this mailing list from a couple of years ago. I wrote most of the patches the stabilised replication between 2.3.3 and 2.3.8) If all your files are on a single filesystem then a rare bug only has to hit once. A frequent bug on the other hand, well - you'll know about them pretty fast... :) None of the filesystems mentioned have frequent bugs (except btrfs and probably tux3 - but they ship with big fat warnings all over) > As for your x4500, I can't tell if those syslog lines you pasted were > from Aug. 2008 or 2007, but certainly since 2007 the marvel SATA > driver has seen some huge improvements to work around some pretty > nasty bugs in the marvell chipset. If you still have that x4500, and > have not applied the current patch for the marvell88sx driver, I > highly suggest doing so. Problems with that chip are some of the > reasons Sun switched to the LSI 1068E as the controller in the x4540. I think it was 2007 actually. We haven't had any trouble with it for a while, but then it does pretty little. The big zpool is just used for backups, which are pretty much one .tar.gz and one .sqlite3 file per user - and the .sqlite3 file is just indexing the .tar.gz file, we can rebuild it by reading the tar file if needed. As a counterpoint to some of the above, we had an issue with Linux where there was a bug in 64 bit writev handling of mmaped space. If you were doing a writev with a mmaped space that crossed a page boundary and the following page wasn't mapped in, it would inject spurious zero bytes in the output where the start of the next page belonged. It took me a few days to prove it was the kernel and create a repeatable test case, and then backwards and forwards with Linus and a couple of other developers we fixed it and tested it _that_day_. I don't know anyone with even unobtanium level support with a commercial vendor who has actually had that sort of turnaround. This caused pretty massive file corruption of especially our skiplist files, but bits of every other meta file too. Luckily, as per above, we had only upgraded one machine. We generally do that with new kernels or software versions - upgrade one production machine and watch it for a bit. We also test things on testbed machines first, but you always find something different on production. The mmap over boundaries case was pretty rare - only a few per day would actually cause a crash, the others were silent corruption that wasn't detected at the time. If something like this hit an only machine, we would have been seriously screwed. Since it only hit one machine, we could apply the fix and re-replicate all the damaged data from the other machine. No actual dataloss. Bron. -- Bron Gondwana brong@xxxxxxxxxxx ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html