Re: Erasure Encoding Chunks

Loic Dachary <loic@xxxxxxxxxxx> · Mon, 08 Dec 2014 11:07:49 +0100

Hi Nick,

On 06/12/2014 17:15, Nick Fisk wrote:
> Hi Loic,
> 
> A very interesting reply and your description of the promotion behaviour
> makes perfect sense. I can see how a larger number of data chunks could
> impact latency, so would certainly impact a OLTP type workload where low
> latency is critical.
> 
> Would you know if the "promotion/EC pool read" step that you described is
> blocking? For example if you had a queue depth higher than 1, would OSDS in
> an EC pool process the promotion requests in parallel and thus take
> advantage of queue reordering on the disks? Or will all OSDS wait for the
> current IO to be read from all data chunks and then process the next IO?

In the case of an RBD volume which is made of multiple objects, the promotion / demotion of each object are independant from each other. The write / read ordering or locking is in the RBD layer and I don't see that the RADOS / tiering / erasurce code logic could interfere. Although I'm not familiar with RBD internals, I would be surprised if promotion / demotion of objects do not happen in parallel. 

> 
> If it does work in parallel, I can see that in batch type workloads the
> increase in latency shouldn't be as much of a problem, as splitting the IO's
> over as many disks as possible, whilst increasing latency, will also
> increase total throughput if queue re-ordering is working. I would imagine
> the number of data chunks could increase until the data chunk size starts
> approaching the IO size, or CPU overhead starts to have an impact.

That makes sense to me. 

> I suppose the same is also true for sequential workloads where more OSD's
> would mean the data is spread over smaller blocks, thus decreasing the
> service time for each IO for each disk increasing total bandwidth?
> 
> Once our cluster is operational I will test some of these theories and post
> the results.

Cool :-)

Cheers
> 
> Many Thanks for your help
> Nick
> 
> 
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Loic Dachary
> Sent: 05 December 2014 17:28
> To: Nick Fisk; 'Ceph Users'
> Subject: Re:  Erasure Encoding Chunks
> 
> 
> 
> On 05/12/2014 17:41, Nick Fisk wrote:
>> Hi Loic,
>>
>> Thanks for your response.
>>
>> The idea for this cluster will be for our VM Replica storage in our 
>> secondary site. Initially we are planning to have a 40 disk EC pool 
>> sitting behind a cache pool of around 1TB post replica size.
>>
>> This storage will be presented as RBD's and then exported as a HA 
>> iSCSI target to ESX hosts. The VM's will be replicated from our 
>> primary site via a software product called Veeam.
>>
>> I'm hoping that the 1TB cache layer should be big enough to hold most 
>> of the hot data meaning that the EC pool shouldn't see a large amount 
>> of IO, just the trickle of the cache layer flushing back to disk. We 
>> can switch back to a 3 way replica pool if the EC pool doesn't work 
>> out for us, but we are interested in testing out the EC technology.
>>
>> I hope that provides an insight to what I am trying to achieve.
> 
> When the erasure coded object has to be promoted back to the replicated
> pool, you want that to happen as fast as possible. The read will return when
> all 6 OSDs give their data chunk to the primary OSD (holding the 7th chunk).
> The 6 read happen in parallel and will complete when the slower OSD returns.
> If you have 16 OSDs instead of 6 you increase the odds of slowing the whole
> read down because one of them is significantly slower than the others. If
> you have 40 OSDs you probably don't need a sophisticated monitoring system
> detecting hard drive misbehavior and a slow disk could go unnoticed and
> degrade your performances significantly because more than a third of the
> objects use it (each object is using 20 OSDs total, 17 of which are for data
> you need to promote to the replicated pool). If you had over 1000 OSDs, you
> would probably need to monitor the hard drives accurately and detect slow
> OSDs sooner and move them out of the cluster. And only a fraction of the
> objects would be impacted by a slow OSD. 
> 
> I would love to hear what an architect would advise.
> 
> Cheers
> 
> 
>>
>> Thanks,
>> Nick
>>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
>> Of Loic Dachary
>> Sent: 05 December 2014 16:23
>> To: Nick Fisk; 'Ceph Users'
>> Subject: Re:  Erasure Encoding Chunks
>>
>>
>>
>> On 05/12/2014 16:21, Nick Fisk wrote:> Hi All,
>>>
>>>  
>>>
>>> Does anybody have any input on what the best ratio + total numbers of 
>>> Data
>> + Coding chunks you would choose?
>>>
>>>  
>>>
>>> For example I could create a pool with 7 data chunks and 3 coding 
>>> chunks
>> and get an efficiency of 70%, or I could create a pool with 17 data 
>> chunks and 3 coding chunks and get an efficiency of 85% with a similar 
>> probability of protecting against OSD failure.
>>>
>>>  
>>>
>>> What’s the reason I would choose 10 total chunks over 20 chunks, is 
>>> it
>> purely down to the overhead of having potentially double the number of 
>> chunks per object?
>>
>> Hi Nick,
>>
>> Assuming you have a large number of OSD (a thousand or more) with cold 
>> data,
>> 20 is probably better. When you try to read the data it involves 20 
>> OSDs instead of 10 but you probably don't care if reads are rare.
>>
>> Disclaimer : I'm a developer not an architect ;-) It would help to 
>> know the target use case, the size of the data set and the expected
> read/write rate.
>>
>> Cheers
>>
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> 
> 
> 
> 
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com