Re: Disk Density Considerations

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Wed, 06 Nov 2013 08:37:36 -0600

On 11/06/2013 06:15 AM, Darren Birkett wrote:
Hi,

I understand from various reading and research that there are a number
of things to consider when deciding how many disks one wants to put into
a single chassis:

1. Higher density means higher failure domain (more data to re-replicate
if you lose a node)
2. More disks means more CPU/memory horsepower to handle the number of OSDs
3. Network becomes a bottleneck with too many OSDs per node
4. ...

We are looking at building high density nodes for small scale 'starter'
deployments for our customers (maybe 4 or 5 nodes).  High density in
this case could mean a 2u chassis with 2x external 45 disk JBOD
containers attached.  That's 90 3TB disks/OSDs to be managed by a single
node.  That's about 243TB of potential usable space, and so (assuming up
to 75% fillage) maybe 182TB of potential data 'loss' in the event of a
node failure.  On an uncongested, unused, 10Gbps network, my
back-of-a-beer-mat calculations say that would take about 45 hours to
get the cluster back into an undegraded state - that is the requisite
number of copies of all objects.

Basically the recommendation I give is that 5 is the absolute bare 
minimum number of nodes I'd put in production, but I'd feel a lot better 
with 10-20 nodes.  The setup you are looking at is 90 drives spread 
across 10U in 1 node, but you could instead use 2 36 drive chassis (I'm 
assuming you are looking at supermicro) with the integrated motherboard 
and do 72 drives in 8U.  The same density, but over double the node 
count.  Further it requires no external SAS cables and you can now do 
4-5 lower bin processors instead of two very top bin processors which 
gives you more overall CPU power for the OSDs.  You can also use cheaper 
less dense memory, and you are buying 1 chassis per node instead of 3 
(though more nodes overall).  Between all of this, you may save enough 
money that the overall hardware costs may not be that much more.

Taking this even further, options like the hadoop fat twin nodes with 12 
drives in 1U potentially could be even denser, while spreading the 
drives out over even more nodes.  Now instead of 4-5 large dense nodes 
you have maybe 35-40 small dense nodes.  The downside here though is 
that the cost may be a bit higher and you have to slide out a whole node 
to swap drives, though Ceph is more tolerant of this than many 
distributed systems.

Assuming that you can shove in a pair of hex core hyperthreaded
processors, you're probably OK with number 2.  If you're already
considering 10GbE networking for the storage network, there's probably
not much you can do about 3 unless you want to spend a lot more money
(and the reason we're going so dense is to keep this as a cheap option).
  So the main thing would seem to be a real fear of 'losing' so much
data in the event of a node failure.  Who wants to wait 45 hours
(probably much longer assuming the cluster remains live and has
production traffic traversing that networl) for the cluster to self-heal?

But surely this fear is based on an assumption that in that time, you've
not identified and replaced the failed chassis.  That you would sit for
2-3 days and just leave the cluster to catch up, and not actually
address the broken node.  Given good data centre processes, a good stock
of spare parts, isn't it more likely that you'd have replaced that node
and got things back up and running in a mater of hours?  In all
likelyhood, a node crash/failure is not likely to have taken out all, or
maybe any, of the disks, and a new chassis can just have the JBODs
plugged back in and away you go?

You might be able to rig up something like this, but honestly hardware 
isn't really the expensive part of distributed systems.  One of the 
advantages that Ceph gives you is that it makes it easier to support 
very large deployments without a ton of maintenance overhead.  Paying an 
extra 10 percent to move away from complicated nodes with external JBODs 
to simpler nodes is worth it imho.

I'm sure I'm missing some other pieces, but if you're comfortable with
your hardware replacement processes, doesn't number 1 become a non-fear
really? I understand that in some ways it goes against the concept of
ceph being self healing, and that in an ideal world you'd have lots of
lower density nodes to limit your failure domain, but when being driven
by cost isn't this an OK way to look at things?  What other glaringly
obvious considerations am I missing with this approach?

When hardware cost is the #1 concern, the way I look at it is that there 
are often one or more sweet spots where it may no longer make sense to 
try to shove more drives in 1 node if it means having to buy denser 
memory, top bin CPUs, exotic controllers, or the very densest drives 
available.

Darren

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com