Re: Erasure Encoding Chunks

Nick Fisk <nick@xxxxxxxxxx> · Sat, 6 Dec 2014 16:15:02 -0000

Hi Loic,

A very interesting reply and your description of the promotion behaviour
makes perfect sense. I can see how a larger number of data chunks could
impact latency, so would certainly impact a OLTP type workload where low
latency is critical.

Would you know if the "promotion/EC pool read" step that you described is
blocking? For example if you had a queue depth higher than 1, would OSDS in
an EC pool process the promotion requests in parallel and thus take
advantage of queue reordering on the disks? Or will all OSDS wait for the
current IO to be read from all data chunks and then process the next IO?

If it does work in parallel, I can see that in batch type workloads the
increase in latency shouldn't be as much of a problem, as splitting the IO's
over as many disks as possible, whilst increasing latency, will also
increase total throughput if queue re-ordering is working. I would imagine
the number of data chunks could increase until the data chunk size starts
approaching the IO size, or CPU overhead starts to have an impact.

I suppose the same is also true for sequential workloads where more OSD's
would mean the data is spread over smaller blocks, thus decreasing the
service time for each IO for each disk increasing total bandwidth?

Once our cluster is operational I will test some of these theories and post
the results.

Many Thanks for your help
Nick

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
Loic Dachary
Sent: 05 December 2014 17:28
To: Nick Fisk; 'Ceph Users'
Subject: Re:  Erasure Encoding Chunks

On 05/12/2014 17:41, Nick Fisk wrote:
> Hi Loic,
> 
> Thanks for your response.
> 
> The idea for this cluster will be for our VM Replica storage in our 
> secondary site. Initially we are planning to have a 40 disk EC pool 
> sitting behind a cache pool of around 1TB post replica size.
> 
> This storage will be presented as RBD's and then exported as a HA 
> iSCSI target to ESX hosts. The VM's will be replicated from our 
> primary site via a software product called Veeam.
> 
> I'm hoping that the 1TB cache layer should be big enough to hold most 
> of the hot data meaning that the EC pool shouldn't see a large amount 
> of IO, just the trickle of the cache layer flushing back to disk. We 
> can switch back to a 3 way replica pool if the EC pool doesn't work 
> out for us, but we are interested in testing out the EC technology.
> 
> I hope that provides an insight to what I am trying to achieve.

When the erasure coded object has to be promoted back to the replicated
pool, you want that to happen as fast as possible. The read will return when
all 6 OSDs give their data chunk to the primary OSD (holding the 7th chunk).
The 6 read happen in parallel and will complete when the slower OSD returns.
If you have 16 OSDs instead of 6 you increase the odds of slowing the whole
read down because one of them is significantly slower than the others. If
you have 40 OSDs you probably don't need a sophisticated monitoring system
detecting hard drive misbehavior and a slow disk could go unnoticed and
degrade your performances significantly because more than a third of the
objects use it (each object is using 20 OSDs total, 17 of which are for data
you need to promote to the replicated pool). If you had over 1000 OSDs, you
would probably need to monitor the hard drives accurately and detect slow
OSDs sooner and move them out of the cluster. And only a fraction of the
objects would be impacted by a slow OSD. 

I would love to hear what an architect would advise.

Cheers

> 
> Thanks,
> Nick
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
> Of Loic Dachary
> Sent: 05 December 2014 16:23
> To: Nick Fisk; 'Ceph Users'
> Subject: Re:  Erasure Encoding Chunks
> 
> 
> 
> On 05/12/2014 16:21, Nick Fisk wrote:> Hi All,
>>
>>  
>>
>> Does anybody have any input on what the best ratio + total numbers of 
>> Data
> + Coding chunks you would choose?
>>
>>  
>>
>> For example I could create a pool with 7 data chunks and 3 coding 
>> chunks
> and get an efficiency of 70%, or I could create a pool with 17 data 
> chunks and 3 coding chunks and get an efficiency of 85% with a similar 
> probability of protecting against OSD failure.
>>
>>  
>>
>> What?s the reason I would choose 10 total chunks over 20 chunks, is 
>> it
> purely down to the overhead of having potentially double the number of 
> chunks per object?
> 
> Hi Nick,
> 
> Assuming you have a large number of OSD (a thousand or more) with cold 
> data,
> 20 is probably better. When you try to read the data it involves 20 
> OSDs instead of 10 but you probably don't care if reads are rare.
> 
> Disclaimer : I'm a developer not an architect ;-) It would help to 
> know the target use case, the size of the data set and the expected
read/write rate.
> 
> Cheers
> 

--
Loïc Dachary, Artisan Logiciel Libre

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com