Hi,
I understand from various reading and research that there are a number of things to consider when deciding how many disks one wants to put into a single chassis:
1. Higher density means higher failure domain (more data to re-replicate if you lose a node)
2. More disks means more CPU/memory horsepower to handle the number of OSDs
3. Network becomes a bottleneck with too many OSDs per node
4. ...
We are looking at building high density nodes for small scale 'starter' deployments for our customers (maybe 4 or 5 nodes). High density in this case could mean a 2u chassis with 2x external 45 disk JBOD containers attached. That's 90 3TB disks/OSDs to be managed by a single node. That's about 243TB of potential usable space, and so (assuming up to 75% fillage) maybe 182TB of potential data 'loss' in the event of a node failure. On an uncongested, unused, 10Gbps network, my back-of-a-beer-mat calculations say that would take about 45 hours to get the cluster back into an undegraded state - that is the requisite number of copies of all objects.
Assuming that you can shove in a pair of hex core hyperthreaded processors, you're probably OK with number 2. If you're already considering 10GbE networking for the storage network, there's probably not much you can do about 3 unless you want to spend a lot more money (and the reason we're going so dense is to keep this as a cheap option). So the main thing would seem to be a real fear of 'losing' so much data in the event of a node failure. Who wants to wait 45 hours (probably much longer assuming the cluster remains live and has production traffic traversing that networl) for the cluster to self-heal?
But surely this fear is based on an assumption that in that time, you've not identified and replaced the failed chassis. That you would sit for 2-3 days and just leave the cluster to catch up, and not actually address the broken node. Given good data centre processes, a good stock of spare parts, isn't it more likely that you'd have replaced that node and got things back up and running in a mater of hours? In all likelyhood, a node crash/failure is not likely to have taken out all, or maybe any, of the disks, and a new chassis can just have the JBODs plugged back in and away you go?
I'm sure I'm missing some other pieces, but if you're comfortable with your hardware replacement processes, doesn't number 1 become a non-fear really? I understand that in some ways it goes against the concept of ceph being self healing, and that in an ideal world you'd have lots of lower density nodes to limit your failure domain, but when being driven by cost isn't this an OK way to look at things? What other glaringly obvious considerations am I missing with this approach?
Darren
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com