Hi Nick, On 06/12/2014 17:15, Nick Fisk wrote: > Hi Loic, > > A very interesting reply and your description of the promotion behaviour > makes perfect sense. I can see how a larger number of data chunks could > impact latency, so would certainly impact a OLTP type workload where low > latency is critical. > > Would you know if the "promotion/EC pool read" step that you described is > blocking? For example if you had a queue depth higher than 1, would OSDS in > an EC pool process the promotion requests in parallel and thus take > advantage of queue reordering on the disks? Or will all OSDS wait for the > current IO to be read from all data chunks and then process the next IO? In the case of an RBD volume which is made of multiple objects, the promotion / demotion of each object are independant from each other. The write / read ordering or locking is in the RBD layer and I don't see that the RADOS / tiering / erasurce code logic could interfere. Although I'm not familiar with RBD internals, I would be surprised if promotion / demotion of objects do not happen in parallel. > > If it does work in parallel, I can see that in batch type workloads the > increase in latency shouldn't be as much of a problem, as splitting the IO's > over as many disks as possible, whilst increasing latency, will also > increase total throughput if queue re-ordering is working. I would imagine > the number of data chunks could increase until the data chunk size starts > approaching the IO size, or CPU overhead starts to have an impact. That makes sense to me. > I suppose the same is also true for sequential workloads where more OSD's > would mean the data is spread over smaller blocks, thus decreasing the > service time for each IO for each disk increasing total bandwidth? > > Once our cluster is operational I will test some of these theories and post > the results. Cool :-) Cheers > > Many Thanks for your help > Nick > > > > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Loic Dachary > Sent: 05 December 2014 17:28 > To: Nick Fisk; 'Ceph Users' > Subject: Re: Erasure Encoding Chunks > > > > On 05/12/2014 17:41, Nick Fisk wrote: >> Hi Loic, >> >> Thanks for your response. >> >> The idea for this cluster will be for our VM Replica storage in our >> secondary site. Initially we are planning to have a 40 disk EC pool >> sitting behind a cache pool of around 1TB post replica size. >> >> This storage will be presented as RBD's and then exported as a HA >> iSCSI target to ESX hosts. The VM's will be replicated from our >> primary site via a software product called Veeam. >> >> I'm hoping that the 1TB cache layer should be big enough to hold most >> of the hot data meaning that the EC pool shouldn't see a large amount >> of IO, just the trickle of the cache layer flushing back to disk. We >> can switch back to a 3 way replica pool if the EC pool doesn't work >> out for us, but we are interested in testing out the EC technology. >> >> I hope that provides an insight to what I am trying to achieve. > > When the erasure coded object has to be promoted back to the replicated > pool, you want that to happen as fast as possible. The read will return when > all 6 OSDs give their data chunk to the primary OSD (holding the 7th chunk). > The 6 read happen in parallel and will complete when the slower OSD returns. > If you have 16 OSDs instead of 6 you increase the odds of slowing the whole > read down because one of them is significantly slower than the others. If > you have 40 OSDs you probably don't need a sophisticated monitoring system > detecting hard drive misbehavior and a slow disk could go unnoticed and > degrade your performances significantly because more than a third of the > objects use it (each object is using 20 OSDs total, 17 of which are for data > you need to promote to the replicated pool). If you had over 1000 OSDs, you > would probably need to monitor the hard drives accurately and detect slow > OSDs sooner and move them out of the cluster. And only a fraction of the > objects would be impacted by a slow OSD. > > I would love to hear what an architect would advise. > > Cheers > > >> >> Thanks, >> Nick >> >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf >> Of Loic Dachary >> Sent: 05 December 2014 16:23 >> To: Nick Fisk; 'Ceph Users' >> Subject: Re: Erasure Encoding Chunks >> >> >> >> On 05/12/2014 16:21, Nick Fisk wrote:> Hi All, >>> >>> >>> >>> Does anybody have any input on what the best ratio + total numbers of >>> Data >> + Coding chunks you would choose? >>> >>> >>> >>> For example I could create a pool with 7 data chunks and 3 coding >>> chunks >> and get an efficiency of 70%, or I could create a pool with 17 data >> chunks and 3 coding chunks and get an efficiency of 85% with a similar >> probability of protecting against OSD failure. >>> >>> >>> >>> What’s the reason I would choose 10 total chunks over 20 chunks, is >>> it >> purely down to the overhead of having potentially double the number of >> chunks per object? >> >> Hi Nick, >> >> Assuming you have a large number of OSD (a thousand or more) with cold >> data, >> 20 is probably better. When you try to read the data it involves 20 >> OSDs instead of 10 but you probably don't care if reads are rare. >> >> Disclaimer : I'm a developer not an architect ;-) It would help to >> know the target use case, the size of the data set and the expected > read/write rate. >> >> Cheers >> > > -- > Loïc Dachary, Artisan Logiciel Libre > > > > > > -- Loïc Dachary, Artisan Logiciel Libre
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com