Bulk storage use case

c.lemarchand@xxxxxxxxxxx (Cédric Lemarchand) · Sat, 10 May 2014 21:43:40 +0200

Hi Craig,

Thanks, I really appreciate the well detailed response.

I carefully note your advices, specifically about the CPU starvation scenario, which as you said sounds scary.

About IO, datas will be very resilient, in case of crash, loosing not fully written objects will not be a problem (they will be re uploaded later), so I think in this specific case, disabling journaling could be a way to improve IO.
How Ceph will handle that, are there caveats other than just loosing objects that was in the data path when the crash occurs ? I know it could sounds weird, but clients workflow could support such thing. 

Thanks !

--
C?dric Lemarchand

> Le 10 mai 2014 ? 04:30, Craig Lewis <clewis at centraldesktop.com> a ?crit :
> 
> I'm still a noob too, so don't take anything I say with much weight.  I was hoping that somebody with more experience would reply.
> 
> 
> I see a few potential problems.
> 
> With that CPU to disk ratio, you're going to need to slow recovery down a lot to make sure you have enough CPU available after a node reboots.  You may need to tune it down even further in the event that a node fails.  I haven't tested a CPU starvation situation, but I suspect that bad things would happen.  You might get stuck with OSDs not responding fast enough, so they get marked down, which triggers a recovery, which uses more CPU, etc.  I'm not even sure how you'd get out of that situation if it started.
> 
> 
> Regarding I/O, your writes being sequential won't matter.  By using journals on the HDDs, all I/O becomes random I/O.  You have a lot of spindles though.  Doing some quick estimate in my head, I figure that you realistically have 200 MBps of I/O per node.  It seems pretty low compared to the combined sequential write speed of 3.6 GBps.  Just remember that every write to an OSD is really two writes, which means you're doing random IO.  10 MBps per disk,       divided by the 2 writes, becomes 5 MBps per disk.  Plus the latency of sending the data over the network to 1 or 2 other disks that have the same constraints.
> 
> With replication = 2, that's 100 MBps per node.  That ends up being (best case) about 800 Mbps of RadosGW writes. Hotspotting and uneven distribution on the nodes will lower that number.  If 1 Gbps writes per node are a hard requirement, I think you're going to be disappointed.  If your application requirements are lower, then you should be ok.
> 
> 
> Regarding latency, it's hard to get specific.  Just remember that your data is being stripped across many disks.  So the latency of one RadosGW operation some where between the max latency of the OSDs, and the sum of the latency of the OSDs.  Like I said, hard to be specific.  To begin with latency will just increase as the load increases.  But at a certain point, problems will start.  OSDs will block because another OSDs won't write it's data.  Your RadosGW load balancer might mark RadosGW nodes down because they're unresponsive.  OSDs might kick other OSDs out because they're too slow.  Most of my Ceph headaches involve too much latency.
> 
> 
> Overall, I think you'll be ok, unless you absolutely have to have that 1 Gbps write speed per node.  Even so, you'll need to prove it.  You really want to load up the cluster with a real amount of data, then simulate a failure and recovery under normal load.  Shut a node down for a day, then bring it back up.  Stuff like that.  A real production failure will stress things differently than `ceph bench` does.  I made the mistake of testing without enough data.  Things worked great when the cluster was 5% used, but had problems when the cluster was 60% used.
> 
> 
> 
> 
>> On 5/9/14 04:18 , C?dric Lemarchand wrote:
>> An other thought, I would hope that with EC, data chunks spreads would profits of each drives writes capability where there will be stored.
>> 
>> I did not get any rely for now ! Does this kind of configuration (hard & soft) looks crazy ?! Am I missing something ?
>> 
>> Looking forward for your comments, thanks in advance. 
>> 
>> --
>> C?dric Lemarchand
>> 
>> Le 7 mai 2014 ? 22:10, Cedric Lemarchand <cedric at yipikai.org> a ?crit :
>> 
>>> Some more details, the io pattern will be around 90%write 10%read, mainly sequential.
>>> Recent posts shows that max_backfills, recovery_max_active and recovery_op_priority settings will be helpful in case of backfilling/re balancing.
>>> 
>>> Any thoughts on such hardware setup ?
>>> 
>>> Le 07/05/2014 11:43, Cedric Lemarchand a ?crit :
>>>> Hello,
>>>> 
>>>> This build is only intended for archiving purpose, what matter here is lowering ratio $/To/W.
>>>> Access to the storage would be via radosgw, installed on each nodes. I need that each nodes sustain an average of 1Gb write rates, for which I think it would not be a problem. Erasure encoding will be used with something like k=12 m=3.
>>>> 
>>>> A typical node would be :
>>>> 
>>>> - Supermicro 36 bays
>>>> - 2x Xeon E5-2630Lv2
>>>> - 96Go ram (recommended ratio 1Go/To for OSD is lowered a bit ... )
>>>> - HBA LSI adaptaters, JBOD mode, could be 2x 9207-8i
>>>> - 36 HDD 4To with default journals config
>>>> - dedicated bonded 2Gb links for public/private networks (backfilling will takes ages if a full node is lost ...)
>>>> 
>>>> 
>>>> I think in an *optimal* state (ceph healthy), it could handle the job. Waiting for your comment.
>>>> 
>>>> What is bothering me more is cases of OSD maintenance operations like backfilling and cluster re balancing, where nodes will be put under very hight IO/memory and CPU load during hours/days. Does the latency will *just* grow up, or does everything will fly away ? (OOMK spawn, OSD suicides because of latency, node pushed out of the cluster, ect ... )
>>>> 
>>>> As you understand I am trying to design the cluster with in mind a sweet spot like "things becomes slow, latency grow up, but the node stay stable/usable and aren't pushed out of the cluster".
>>>> 
>>>> This is my first jump into Ceph, so any inputs will be greatly appreciated ;-)
>>>> 
>>>> Cheers,
>>>> 
>>>> --
>>>> C?dric 
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> -- 
>>> C?dric
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> -- 
> Craig Lewis 
> Senior Systems Engineer
> Office +1.714.602.1309
> Email clewis at centraldesktop.com
> 
> Central Desktop. Work together in ways you never thought possible. 
> Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140510/0b5b4404/attachment.htm>