Hi Loic, A very interesting reply and your description of the promotion behaviour makes perfect sense. I can see how a larger number of data chunks could impact latency, so would certainly impact a OLTP type workload where low latency is critical. Would you know if the "promotion/EC pool read" step that you described is blocking? For example if you had a queue depth higher than 1, would OSDS in an EC pool process the promotion requests in parallel and thus take advantage of queue reordering on the disks? Or will all OSDS wait for the current IO to be read from all data chunks and then process the next IO? If it does work in parallel, I can see that in batch type workloads the increase in latency shouldn't be as much of a problem, as splitting the IO's over as many disks as possible, whilst increasing latency, will also increase total throughput if queue re-ordering is working. I would imagine the number of data chunks could increase until the data chunk size starts approaching the IO size, or CPU overhead starts to have an impact. I suppose the same is also true for sequential workloads where more OSD's would mean the data is spread over smaller blocks, thus decreasing the service time for each IO for each disk increasing total bandwidth? Once our cluster is operational I will test some of these theories and post the results. Many Thanks for your help Nick -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Loic Dachary Sent: 05 December 2014 17:28 To: Nick Fisk; 'Ceph Users' Subject: Re: Erasure Encoding Chunks On 05/12/2014 17:41, Nick Fisk wrote: > Hi Loic, > > Thanks for your response. > > The idea for this cluster will be for our VM Replica storage in our > secondary site. Initially we are planning to have a 40 disk EC pool > sitting behind a cache pool of around 1TB post replica size. > > This storage will be presented as RBD's and then exported as a HA > iSCSI target to ESX hosts. The VM's will be replicated from our > primary site via a software product called Veeam. > > I'm hoping that the 1TB cache layer should be big enough to hold most > of the hot data meaning that the EC pool shouldn't see a large amount > of IO, just the trickle of the cache layer flushing back to disk. We > can switch back to a 3 way replica pool if the EC pool doesn't work > out for us, but we are interested in testing out the EC technology. > > I hope that provides an insight to what I am trying to achieve. When the erasure coded object has to be promoted back to the replicated pool, you want that to happen as fast as possible. The read will return when all 6 OSDs give their data chunk to the primary OSD (holding the 7th chunk). The 6 read happen in parallel and will complete when the slower OSD returns. If you have 16 OSDs instead of 6 you increase the odds of slowing the whole read down because one of them is significantly slower than the others. If you have 40 OSDs you probably don't need a sophisticated monitoring system detecting hard drive misbehavior and a slow disk could go unnoticed and degrade your performances significantly because more than a third of the objects use it (each object is using 20 OSDs total, 17 of which are for data you need to promote to the replicated pool). If you had over 1000 OSDs, you would probably need to monitor the hard drives accurately and detect slow OSDs sooner and move them out of the cluster. And only a fraction of the objects would be impacted by a slow OSD. I would love to hear what an architect would advise. Cheers > > Thanks, > Nick > > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > Of Loic Dachary > Sent: 05 December 2014 16:23 > To: Nick Fisk; 'Ceph Users' > Subject: Re: Erasure Encoding Chunks > > > > On 05/12/2014 16:21, Nick Fisk wrote:> Hi All, >> >> >> >> Does anybody have any input on what the best ratio + total numbers of >> Data > + Coding chunks you would choose? >> >> >> >> For example I could create a pool with 7 data chunks and 3 coding >> chunks > and get an efficiency of 70%, or I could create a pool with 17 data > chunks and 3 coding chunks and get an efficiency of 85% with a similar > probability of protecting against OSD failure. >> >> >> >> What?s the reason I would choose 10 total chunks over 20 chunks, is >> it > purely down to the overhead of having potentially double the number of > chunks per object? > > Hi Nick, > > Assuming you have a large number of OSD (a thousand or more) with cold > data, > 20 is probably better. When you try to read the data it involves 20 > OSDs instead of 10 but you probably don't care if reads are rare. > > Disclaimer : I'm a developer not an architect ;-) It would help to > know the target use case, the size of the data set and the expected read/write rate. > > Cheers > -- Loïc Dachary, Artisan Logiciel Libre _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com