Re: Disk Density Considerations

Andrey Korolyov <andrey@xxxxxxx> · Wed, 6 Nov 2013 18:08:38 +0400

On Wed, Nov 6, 2013 at 4:15 PM, Darren Birkett <darren.birkett@xxxxxxxxx> wrote:
> Hi,
>
> I understand from various reading and research that there are a number of
> things to consider when deciding how many disks one wants to put into a
> single chassis:
>
> 1. Higher density means higher failure domain (more data to re-replicate if
> you lose a node)
> 2. More disks means more CPU/memory horsepower to handle the number of OSDs
> 3. Network becomes a bottleneck with too many OSDs per node
> 4. ...
>
> We are looking at building high density nodes for small scale 'starter'
> deployments for our customers (maybe 4 or 5 nodes).  High density in this
> case could mean a 2u chassis with 2x external 45 disk JBOD containers
> attached.  That's 90 3TB disks/OSDs to be managed by a single node.  That's
> about 243TB of potential usable space, and so (assuming up to 75% fillage)
> maybe 182TB of potential data 'loss' in the event of a node failure.  On an
> uncongested, unused, 10Gbps network, my back-of-a-beer-mat calculations say
> that would take about 45 hours to get the cluster back into an undegraded
> state - that is the requisite number of copies of all objects.
>

For such large number of disks you should consider that the cache
amortization will not take any place even if you are using 1GB
controller(s) - only tiered cache can be an option. Also recovery will
take much more time even if you have a room for client I/O in the
calculations because raw disks have very limited IOPS capacity and
recovery will either take a much longer than such expectations at a
glance or affect regular operations. For S3/Swift it may be acceptable
but for VM images it does not.

> Assuming that you can shove in a pair of hex core hyperthreaded processors,
> you're probably OK with number 2.  If you're already considering 10GbE
> networking for the storage network, there's probably not much you can do
> about 3 unless you want to spend a lot more money (and the reason we're
> going so dense is to keep this as a cheap option).  So the main thing would
> seem to be a real fear of 'losing' so much data in the event of a node
> failure.  Who wants to wait 45 hours (probably much longer assuming the
> cluster remains live and has production traffic traversing that networl) for
> the cluster to self-heal?
>
> But surely this fear is based on an assumption that in that time, you've not
> identified and replaced the failed chassis.  That you would sit for 2-3 days
> and just leave the cluster to catch up, and not actually address the broken
> node.  Given good data centre processes, a good stock of spare parts, isn't
> it more likely that you'd have replaced that node and got things back up and
> running in a mater of hours?  In all likelyhood, a node crash/failure is not
> likely to have taken out all, or maybe any, of the disks, and a new chassis
> can just have the JBODs plugged back in and away you go?
>
> I'm sure I'm missing some other pieces, but if you're comfortable with your
> hardware replacement processes, doesn't number 1 become a non-fear really? I
> understand that in some ways it goes against the concept of ceph being self
> healing, and that in an ideal world you'd have lots of lower density nodes
> to limit your failure domain, but when being driven by cost isn't this an OK
> way to look at things?  What other glaringly obvious considerations am I
> missing with this approach?
>
> Darren
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com