On Mon, Sep 28, 2009 at 03:33:44PM -0700, Vincent Fox wrote: > Bron Gondwana wrote: > >I assume you mean 500 gigs! We're switching from 300 to 500 on new > >filesystems because we have one business customer that's over > >150Gb now and we want to keep all their users on the one partition > >for > >folder sharing. We don't do any murder though. > > > Oops yes. I meant 500 gigs. The potential downside of > running an fsck on terabyte+ filesystems is not worth > the risks IMO. The tremendous speed & efficiency of > Cyrus is in it's small files and the indexes. However you > have to keep that in mind when estimating not just backups > and other daily/weekly items but more serious items. For sure. > Really I've looked at fsck too many times in my life and > don't ever want to again. Anyone who tells me "oh yes but > journalling solved all that long ago...." will get an earful > from me about how they haven't run a big enough setup > with enough stress on it to SEE real problems. I have seen > both journalled Linux and logged Solaris filesystem turn up > with data corruption and ended up staring at that fsck > prompt wondering how many hours until it's done..... Yep. Which is why we treat filesystems as disposable :) There are multiple real-time replicated copies of anything we care about, so we can blow away a filesystem and just recreate it. Even after a successful fsck I might just decide it's cheaper to recreate it than run a full sha1 checking audit_slot on the contents! > The antiquated filesystems that 99% of admins tolerate and > work with every day should be lumped under some kind of > Geneva provision against torture. It's a mystery to me why > it's not resolved years ago and why there isn't a big push > for it from anyone. Patents I suspect, at least partially. > "It doesn't matter how fast it is, if it isn't CORRECT!" should > be some kind of mantra for a production data center but it > still seems majority of my colleagues talk same as in 1980s' > about how if we turn off this or that safety feature we can > make the filesystem faster. Everything's a tradeoff, hey. With enough checksums and replication, I'm willing to treat every layer as less than 100% reliable, because that's reality. I haven't heard too many horror stories of ZFS recently, but we certainly hit a bug where we needed a software update before we could replace a failed disk, because ZFS refused to consider anything plugged into the same controller again, even after a reboot. That was odd. > OK stepping off my soapbox now. It's an interesting one. For real reliability, I want to have multiple replication target supported cleanly. It's not even that hard. Basically you would chain sync_client instances, such that there was an initial task that just reads $conf/sync/log and appends the contents to both $conf/sync/stream1/log and $conf/sync/stream2/log, then a separate sync_client instance that operates in each of $conf/sync/stream1 and $conf/sync/stream2, replicating to separate backends. This would involve minimal code changes I suspect, and allow a replica to be offline while the other two are up-to-date, and still know what needed syncing when you turned it back on! Then we'd be able to bring up a new replica BEFORE removing the old one. It's like RAID1 with three disks :) Add a new one, remove the old. Always 2 up-to-date copies. Then add management tools to make that easy to start and stop! It's an ongoing task to improve reliability. I actually wonder if it's possible to have multiple Cyrus instances running in a mesh. Each one running a sync_server and with sync_client instances running on every other one. In THEORY so long as you only wrote to one at any one time you could read from any of them, or even if you only had connections for a single user happening to one at any one time you'd be OK. You could hash users amongst them to balance the load. Then - well, I already have checksums coded into index files, just waiting code review from Ken to push that upstream. Along with sha1s, that's 99% of the data covered by checksums. Flat files (quota and the like) I don't think are viable, but it might be possible to add checksums to skiplist as well, at the expense of a format change. Not sure about BDB. I'm not a giant fan of it anyway - at least how it's being used in Cyrus. All our DBs are skiplist now, and we're pretty happy with it :) Bron. ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html