On 11/06/2013 06:15 AM, Darren Birkett wrote:
Hi, I understand from various reading and research that there are a number of things to consider when deciding how many disks one wants to put into a single chassis: 1. Higher density means higher failure domain (more data to re-replicate if you lose a node) 2. More disks means more CPU/memory horsepower to handle the number of OSDs 3. Network becomes a bottleneck with too many OSDs per node 4. ... We are looking at building high density nodes for small scale 'starter' deployments for our customers (maybe 4 or 5 nodes). High density in this case could mean a 2u chassis with 2x external 45 disk JBOD containers attached. That's 90 3TB disks/OSDs to be managed by a single node. That's about 243TB of potential usable space, and so (assuming up to 75% fillage) maybe 182TB of potential data 'loss' in the event of a node failure. On an uncongested, unused, 10Gbps network, my back-of-a-beer-mat calculations say that would take about 45 hours to get the cluster back into an undegraded state - that is the requisite number of copies of all objects.
Basically the recommendation I give is that 5 is the absolute bare minimum number of nodes I'd put in production, but I'd feel a lot better with 10-20 nodes. The setup you are looking at is 90 drives spread across 10U in 1 node, but you could instead use 2 36 drive chassis (I'm assuming you are looking at supermicro) with the integrated motherboard and do 72 drives in 8U. The same density, but over double the node count. Further it requires no external SAS cables and you can now do 4-5 lower bin processors instead of two very top bin processors which gives you more overall CPU power for the OSDs. You can also use cheaper less dense memory, and you are buying 1 chassis per node instead of 3 (though more nodes overall). Between all of this, you may save enough money that the overall hardware costs may not be that much more.
Taking this even further, options like the hadoop fat twin nodes with 12 drives in 1U potentially could be even denser, while spreading the drives out over even more nodes. Now instead of 4-5 large dense nodes you have maybe 35-40 small dense nodes. The downside here though is that the cost may be a bit higher and you have to slide out a whole node to swap drives, though Ceph is more tolerant of this than many distributed systems.
Assuming that you can shove in a pair of hex core hyperthreaded processors, you're probably OK with number 2. If you're already considering 10GbE networking for the storage network, there's probably not much you can do about 3 unless you want to spend a lot more money (and the reason we're going so dense is to keep this as a cheap option). So the main thing would seem to be a real fear of 'losing' so much data in the event of a node failure. Who wants to wait 45 hours (probably much longer assuming the cluster remains live and has production traffic traversing that networl) for the cluster to self-heal? But surely this fear is based on an assumption that in that time, you've not identified and replaced the failed chassis. That you would sit for 2-3 days and just leave the cluster to catch up, and not actually address the broken node. Given good data centre processes, a good stock of spare parts, isn't it more likely that you'd have replaced that node and got things back up and running in a mater of hours? In all likelyhood, a node crash/failure is not likely to have taken out all, or maybe any, of the disks, and a new chassis can just have the JBODs plugged back in and away you go?
You might be able to rig up something like this, but honestly hardware isn't really the expensive part of distributed systems. One of the advantages that Ceph gives you is that it makes it easier to support very large deployments without a ton of maintenance overhead. Paying an extra 10 percent to move away from complicated nodes with external JBODs to simpler nodes is worth it imho.
I'm sure I'm missing some other pieces, but if you're comfortable with your hardware replacement processes, doesn't number 1 become a non-fear really? I understand that in some ways it goes against the concept of ceph being self healing, and that in an ideal world you'd have lots of lower density nodes to limit your failure domain, but when being driven by cost isn't this an OK way to look at things? What other glaringly obvious considerations am I missing with this approach?
When hardware cost is the #1 concern, the way I look at it is that there are often one or more sweet spots where it may no longer make sense to try to shove more drives in 1 node if it means having to buy denser memory, top bin CPUs, exotic controllers, or the very densest drives available.
Darren _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com