On Wed, Sep 15, 2010 at 09:15:13AM +0530, Shuvam Misra wrote: > Dear Bron, > > > http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413 > > > > 2TB - US $109. > > Don't want to nit-pick here, but the effective price we pay is about > ten times this. Yeah, so? It's going down. That's a large number of attachments we're talking about there. > To set up a mail server with a few TB of disk space, > we usually land up deploying a separate chassis with RAID controllers and > a RAID array, with FC connections from servers, etc, etc. All this adds > up to about $1,000/TB of usable space if you're using something like the > "low-end" IBM DS3400 box or Dell/EMC equivalent. This is even with > inexpensive 7200RPM SATA-II drives, not 15KRPM SAS drives. Hmm... our storage units with metadata on SSD come in about $1200/TB. Yes, that sounds about right. That's including hot spares, RAID1 on everything (including the SSDs), scads of processor and memory. Obviously multiply that by two for replication, and add in a bit of extra for backups and I'm happy to arrive at a figure of approximately $3000 per terabyte of actual email. > And most of our customers actually double this cost because they keep two > physically identical chassis for redundancy. (We recommend this too, > because we can't trust a single RAID 5 array to withstand controller or > PSU failures.) In that case, it's $2000/TB. And because it's nice not to have downtime when you're doing maintainence. I replaced an entire drive unit today, including about 4 hours downtime on one of our servers as the system was swamped with IO creating new filesystems and initialising the drives. The users didn't see a thing, and repliation is now fully operational again. > And you do reach 5-10 TB of email store quite rapidly --- our company > has many corporate clients (< 500 email users) whose IMAP store has > reached 4TB. No one wants to enforce disk quotas (corporate policy), > and most users don't want to delete emails on their own. So you save, what, 50%. Does that sound about right? Do you have statistics on how much space you'd save with this theoretical patch? > We keep hearing the logic that storage is cheap, and stories of cloud > storage through Amazon, unlimited mailboxes on Gmail, are reinforcing > the belief. But at the ground level in mid-market corporate IT budgets, > storage costs in data centres (as against inside desktops) are still > too high to be trivial, and their prices have only little to do with > the prices of raw SATA-II drives. A fully-loaded DS3400 costs a little > over $12,000 in India, with a full set of 1TB SATA-II drives from IBM, > but even with high cost of IBM drives, the drives themselves contribute > less than 30% of the total cost. You're buying a few months. Usage grows to fill the available storage, whatever it is. And you can only pull this piece of magic once. > If we really want to put our collective money where our mouth is, and > deliver the storage-is-cheap promise at the ground level, we need to > rearchitect every file server and IMAP server to work in map-reduce mode > and use disks inside desktops. Anyone game for this project? :) You could buy as much benefit much more quickly by gzipping the individual email files. Either a filesystem that stores files compressed, or a cyrus patch to do that and unpack files on the fly if the body was read. Along with most/all headers in the cyrus.cache file, the body doesn't get opened very often. Man a body search would hurt though! > > Now de-duping messages on copy is valuable, not so much because of > > the space it saves, but because of the IO it saves. Copying the file > > around is expensive. > > > > De-duping componenets of messages and then reconstructing? Not so much. > > You'll be causing MORE IO in general looking for the message, finding the > > parts. > > I agree. My aim was not to reduce IOPS but to cut disk space usage. IOPS matter too. Depending on your usage patterns obviously. If you don't ever get body searches on your server they probably matter less. > A 500-user company can easily acquire an email archive of 2-5TB. I don't > care how much the IO load of that archive server increases, but I'd like > to reduce disk space utilisation. If the customer can stick to 2TB of > space requirements, he can use a desktop with two 2TB drives in RAID > 1, and get a real cheap archive server. If this figure reaches 3-4TB, > he goes into a separate RAID chassis --- the hardware cost goes up 5-10 > times. These are tradeoffs a lot of small to mid-sized companies in my > market fuss about. Sounds like a case for a cheaper RAID chassis to me. Or actually cleaning up a little. While I appreciate the tradeoff, I think they'll still fill up pretty quickly even with this. It's a short term stop-gap measure. > And in a more generic context, I am seeing that all kinds of intelligent > de-duping of infrequently-accessed data is going to become the crying > need of every mid-sized and large company. Data is growing too fast, > and no one wants to impose user discipline or data cleaning. When we > tell the business head "This is crazy!", he turns around and tells the > CTO "But disk space is cheap! Haven't you heard of Google? What are you > cribbing about? You must be doing something really inefficient here, > wasting money!" Well... yes. It would be lovely. We're in the realm of binary diffs here to really get efficient. I appreciate the goals, but I'm not sold on the effectiveness. Mind you, if you wrote a robust and efficient system that did it, I'd use it! I'm just not convinced that the work to get there will pay for itself in actual value. Bron. ---- Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/