Hello, I'm leaving the original post below for your perusal, but will top-post here with the conclusions/decisions so far. Firstly, I got a quotation from our vendors here (assuming a slightly incorrect 100:1 exchange rate of Japanese Yen to USD). A storage node as described below, with 24 3TB disks, 2 Intel 3700DC SSDs, a mobo with two 4300 CPUs (12 cores total), 32GB RAM and one dual port Infiniband HBA will cost about $7700. The Areca 1882 24port controller with 4GB cache is another $1900 (I'm sure it would be cheaper in the US, like all the other gear). The above times two for about 60TB capacity total. To get the same storage space and (roughly) reliability relying on Ceph replication and rebalancing with one OSD per disk would require an additional storage node at the aforementioned $7700. And $700 (I looked this up on google, so these are real dollars and thus certainly more expensive in Japan, but lets just go with that number) per node for HBAs that I trust (and have the right connectors for the backplane), namely a LSI 16 port and an 8 port one. In all likelihood with 24 OSDs per node larger SSDs would be in order, too. Which winds up being $6000 more expensive than the approach with the insanely expensive RAID controllers. I fully acknowledge the point of spindle contention in my design that Mike Dawson brought up, however I'm confident that between enabling RBD caching, OSD journals on SSDs and the 4GB controller cache this won't be as crippling as one might think. I'm simply not convinced that the additional $6000, 4U of rack space, having to deal with failed OSDs on a frequent basis and requiring people with at least half a clue to replace disks are worth the additional IOPs in my case. Further comments are of course still welcome! Christian On Tue, 17 Dec 2013 16:44:29 +0900 Christian Balzer wrote: > > Hello, > > I've been doing a lot of reading and am looking at the following design > for a storage cluster based on Ceph. I will address all the likely > knee-jerk reactions and reasoning below, so hold your guns until you've > read it all. I also have a number of questions I've not yet found the > answer to or determined it by experimentation. > > Hardware: > 2x 4U (can you say Supermicro? ^.^) servers with 24 3.5" hotswap bays, 2 > internal OS (journal?) drives, probably Opteron 4300 CPUs (see below), > Areca 1882 controller with 4GB cache, 2 or 3 2-port Infiniband HCAs. > 24 3TB HDs (30% of the price of a 4TB one!) in one or two RAID6, 2 of > them hotspares, giving us 60TB per node and thus with a replication > factor of 2 that's also the usable space. > Space for 2 more identical servers if need be. > > Network: > Infiniband QDR, 2x 18port switches (interconnected of course), redundant > paths everywhere, including to the clients (compute nodes). > > Ceph configuration: > Additional server with mon, mons also on the 2 storage nodes, at least 2 > OSDs per node (see below) > > This is for a private cloud with about 500 VMs at most. There will 2 > types of VMs, the majority writing a small amount of log chatter to their > volumes, the other type (a few dozen) writing a more substantial data > stream. > I estimate less than 100MB/s of read/writes at full build out, which > should be well within the abilities of this setup. > > > Now for the rationale of this design that goes contrary to anything > normal Ceph layouts suggest: > > 1. Idiot (aka NOC monkey) proof hotswap of disks. > This will be deployed in a remote data center, meaning that qualified > people will not be available locally and thus would have to travel there > each time a disk or two fails. > In short, telling somebody to pull the disk tray with the red flashing > LED and put a new one from the spare pile in there is a lot more likely > to result in success than telling them to pull the 3rd row, 4th column > disk in server 2. ^o^ > > 2. Density, TCO > Ideally I would love to deploy something like this: > http://www.mbx.com/60-drive-4u-storage-server/ > but they seem to not really have a complete product description, price > list, etc. ^o^ With a monster like that, I'd be willing to reconsider > local raids and just overspec things in a way that a LOT disks can fail > before somebody (with a clue) needs to visit that DC. > However failing that, the typical approach to use many smaller servers > for OSDs increases the costs and/or reduces density. Replacing the 4U > servers with 2U ones (that hold 12 disks) would require some sort of > controller (to satisfy my #1 requirement) and similar amounts of HCAs > per node, clearly driving the TCO up. 1U servers with typically 4 disk > would be even worse. > > 3. Increased reliability/stability > Failure of a single disk has no impact on the whole cluster, no need for > any CPU/network intensive rebalancing. > > > Questions/remarks: > > Due to the fact that there will be redundancy, reliability on the disk > level and that there will be only 2 storage nodes initially, I'm > planning to disable rebalancing. > Or will Ceph realize that making replicas on the same server won't really > save the day and refrain from doing so? > If more nodes are added later, I will likely set an appropriate full > ratio and activate rebalancing on a permanent basis again (except for > planned maintenances of course). > My experience tells me that an actual node failure will be due to: > 1. Software bugs, kernel or otherwise. > 2. Marginal hardware (CPU/memory/mainboard hairline cracks, I've seen it > all) > Actual total loss of power in the DC doesn't worry me, because if that > happens I'm likely under a ton of rubble, this being Japan. ^_^ > > Given that a RAID6 with just 7 disk connected to an Areca 1882 > controller in a different cluster I'm running here gives me about > 800MB/s writes and 1GB/s reads I have a feeling that putting the journal > on SSDs (Intel DC S3700) would be a waste, if not outright harmful. > But I guess I shall determine that by testing, maybe the higher IOPS rate > will still be beneficial. > Since the expected performance of this RAID will be at least double the > bandwidth available on a single IB interface, I'm thinking of splitting > it in half and have an OSD for each half and bound to a different > interface. One hopes that nothing in the OSD design stops it from > dealing with these speeds/bandwidths. > > The plan is to use Ceph only for RBD, so would "filestore xattr use omap" > really be needed in case tests determine ext4 to be faster than xfs in my > setup? > > Given the above configuration, I'm wondering how many CPU cores would be > sufficient in the storage nodes. > Somewhere in the documentation > http://ceph.com/docs/master/start/hardware-recommendations/ > is a recommendation for 1GB RAM per 1TB of storage, but later on the same > page we see a storage server example with 36TB and 16GB RAM. > Ideally I would love to use just one 6 or 8 core Opteron 4300 with 32GB > of memory, thus having only one NUMA domain and keeping all the > processes dealing with I/O and interrupts (and there will be lots of > them) on the same CPU. (I very much agree with Kyle Bader's mail about > NUMA last week, I've seen this happen myself) > > According to this: > http://ceph.com/docs/master/rados/configuration/network-config-ref/ > a monitor doesn't need to be on the cluster network, only the OSDs do. > However my initial tests seemed to tell me otherwise. If the mons don't > need to be on the cluster network, I'd consider for the initial > deployment a direct IB interconnect between the 2 storage nodes. > This actually brings me to my next question, should the cluster network > fail, would the OSDs still continue to function and use the public > network instead? > > I hope this wasn't tl;dr. ^o^ > > Regards, > > Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com