Bulk storage use case

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Fri, 09 May 2014 19:30:01 -0700

I'm still a noob too, so don't take anything I say with much weight.  I 
was hoping that somebody with more experience would reply.

I see a few potential problems.

With that CPU to disk ratio, you're going to need to slow recovery down 
a lot to make sure you have enough CPU available after a node reboots.  
You may need to tune it down even further in the event that a node 
fails.  I haven't tested a CPU starvation situation, but I suspect that 
bad things would happen.  You might get stuck with OSDs not responding 
fast enough, so they get marked down, which triggers a recovery, which 
uses more CPU, etc.  I'm not even sure how you'd get out of that 
situation if it started.

Regarding I/O, your writes being sequential won't matter.  By using 
journals on the HDDs, all I/O becomes random I/O.  You have a lot of 
spindles though.  Doing some quick estimate in my head, I figure that 
you realistically have 200 MBps of I/O per node.  It seems pretty low 
compared to the combined sequential write speed of 3.6 GBps.  Just 
remember that every write to an OSD is really two writes, which means 
you're doing random IO.  10 MBps per disk, divided by the 2 writes, 
becomes 5 MBps per disk.  Plus the latency of sending the data over the 
network to 1 or 2 other disks that have the same constraints.

With replication = 2, that's 100 MBps per node.  That ends up being 
(best case) about 800 Mbps of RadosGW writes. Hotspotting and uneven 
distribution on the nodes will lower that number.  If 1 Gbps writes per 
node are a hard requirement, I think you're going to be disappointed.  
If your application requirements are lower, then you should be ok.

Regarding latency, it's hard to get specific.  Just remember that your 
data is being stripped across many disks.  So the latency of one RadosGW 
operation some where between the max latency of the OSDs, and the sum of 
the latency of the OSDs.  Like I said, hard to be specific.  To begin 
with latency will just increase as the load increases.  But at a certain 
point, problems will start. OSDs will block because another OSDs won't 
write it's data.  Your RadosGW load balancer might mark RadosGW nodes 
down because they're unresponsive.  OSDs might kick other OSDs out 
because they're too slow.  Most of my Ceph headaches involve too much 
latency.

Overall, I think you'll be ok, unless you absolutely have to have that 1 
Gbps write speed per node.  Even so, you'll need to prove it.  You 
really want to load up the cluster with a real amount of data, then 
simulate a failure and recovery under normal load. Shut a node down for 
a day, then bring it back up.  Stuff like that.  A real production 
failure will stress things differently than `ceph bench` does.  I made 
the mistake of testing without enough data.  Things worked great when 
the cluster was 5% used, but had problems when the cluster was 60% used.

On 5/9/14 04:18 , C?dric Lemarchand wrote:
> An other thought, I would hope that with EC, data chunks spreads would 
> profits of each drives writes capability where there will be stored.
>
> I did not get any rely for now ! Does this kind of configuration (hard 
> & soft) looks crazy ?! Am I missing something ?
>
> Looking forward for your comments, thanks in advance.
>
> -- 
> C?dric Lemarchand
>
> Le 7 mai 2014 ? 22:10, Cedric Lemarchand <cedric at yipikai.org 
> <mailto:cedric at yipikai.org>> a ?crit :
>
>> Some more details, the io pattern will be around 90%write 10%read, 
>> mainly sequential.
>> Recent posts shows that max_backfills, recovery_max_active and 
>> recovery_op_priority settings will be helpful in case of 
>> backfilling/re balancing.
>>
>> Any thoughts on such hardware setup ?
>>
>> Le 07/05/2014 11:43, Cedric Lemarchand a ?crit :
>>> Hello,
>>>
>>> This build is only intended for archiving purpose, what matter here 
>>> is lowering ratio $/To/W.
>>> Access to the storage would be via radosgw, installed on each nodes. 
>>> I need that each nodes sustain an average of 1Gb write rates, for 
>>> which I think it would not be a problem. Erasure encoding will be 
>>> used with something like k=12 m=3.
>>>
>>> A typical node would be :
>>>
>>> - Supermicro 36 bays
>>> - 2x Xeon E5-2630Lv2
>>> - 96Go ram (recommended ratio 1Go/To for OSD is lowered a bit ... )
>>> - HBA LSI adaptaters, JBOD mode, could be 2x 9207-8i
>>> - 36 HDD 4To with default journals config
>>> - dedicated bonded 2Gb links for public/private networks 
>>> (backfilling will takes ages if a full node is lost ...)
>>>
>>>
>>> I think in an *optimal* state (ceph healthy), it could handle the 
>>> job. Waiting for your comment.
>>>
>>> What is bothering me more is cases of OSD maintenance operations 
>>> like backfilling and cluster re balancing, where nodes will be put 
>>> under very hight IO/memory and CPU load during hours/days. Does the 
>>> latency will *just* grow up, or does everything will fly away ? 
>>> (OOMK spawn, OSD suicides because of latency, node pushed out of the 
>>> cluster, ect ... )
>>>
>>> As you understand I am trying to design the cluster with in mind a 
>>> sweet spot like "things becomes slow, latency grow up, but the node 
>>> stay stable/usable and aren't pushed out of the cluster".
>>>
>>> This is my first jump into Ceph, so any inputs will be greatly 
>>> appreciated ;-)
>>>
>>> Cheers,
>>>
>>> --
>>> C?dric
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> -- 
>> C?dric
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis at centraldesktop.com <mailto:clewis at centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140509/a79c2e05/attachment.htm>