FWIW, I'm beginning to think that SSD journals are a requirement. Even with minimal recovery/backfilling settings, it's very easy to kick off an operation that will bring a cluster to it's knees. Increasing PG/PGP, increasing replication, adding too many new OSDs, etc. These operations can cause latency to increase 50x. The SSDs won't completely hide it, but they've brought latency down to "painful but tolerable". As Kostis suggests, the only option I've found so far is to do smaller operations, and I hope I made them small enough. Try to do something that will affect less than 10% of your OSDs. ie, instead of adding 10% new OSDs in one operation, add one per node, and wait until the recovery finishes. It takes a lot longer, and moves data many time, but my latency generally only doubles instead of 50x. I've figured out how to that for OSD additions and PG/PGP increases. I haven't figured out a way to do it for replication levels. If I want to change a replication level, I think it will be better to create new pools, and migrate the data manually. On Wed, Jul 9, 2014 at 6:59 AM, Gregory Farnum <greg at inktank.com> wrote: > You're physically moving (lots of) data around between most of your > disks. There's going to be an IO impact from that, although we are > always working on ways to make it more controllable and try to > minimize its impact. Your average latency increase sounds a little > high to me, but I don't have much data to draw from; maybe others who > have done this on large clusters can discuss. > Basically, think of what happens to IO performance on a resilvering > RAID array. We should be a lot better than that, but it's the same > concept. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > On Tue, Jul 8, 2014 at 11:15 PM, Kostis Fardelas <dante1234 at gmail.com> > wrote: > > Hi Greg, > > thanks for your immediate feedback. My comments follow. > > > > Initially we thought that the 248 PG (15%) increment we used was > > really small, but it seems that we should increase PGs in even small > > increments. I think that the term "multiples" is not the appropriate > > term here, I fear someone would assume that it is the same (or even > > the right way to do) to go from 10 PGs to 20 PGs and from 1000 PGs to > > 2000 PGs just because he/she uses a small 2X multiple. > > > > Regarding, the data movement due to pgp_num increase, we had already > > set osd_max_backfills, osd_recovery_max_active, > > osd_recovery_op_priority, osd_recovery_threads to their minimum values > > but we still got impacted. The first two are also set in ceph.conf but > > we use to change all four of them at runtime (through injecting). Is > > there anything else we should check? Is it some known issue? > > > > Another question that came up from our exercise is related to pool > > isolation during PG remapping. As I reported we only changed the > > pg/pgp num in one of our pools but ceph client io and ceph ops seem to > > have dropped at cluster level (verified by looking at ceph status). > > Did our second pool got impacted too or we should take from granted > > that the pools are indeed isolated during remapping and there is a > > ceph status view granularity issue here? > > > > Regards, > > Kostis > > > > On 8 July 2014 20:01, Gregory Farnum <greg at inktank.com> wrote: > >> The impact won't be 300 times bigger, but it will be bigger. There are > two > >> things impacting your cluster here > >> 1) the initial "split" of the affected PGs into multiple child PGs. You > can > >> mitigate this by stepping through pg_num at small multiples. > >> 2) the movement of data to its new location (when you adjust pgp_num). > This > >> can be adjusted by setting the "OSD max backfills" and related > parameters; > >> check the docs. > >> -Greg > >> > >> > >> On Tuesday, July 8, 2014, Kostis Fardelas <dante1234 at gmail.com> wrote: > >>> > >>> Hi, > >>> we maintain a cluster with 126 OSDs, replication 3 and appr. 148T raw > >>> used space. We store data objects basically on two pools, the one > >>> being appr. 300x larger in data stored and # of objects terms than the > >>> other. Based on the formula provided here > >>> http://ceph.com/docs/master/rados/operations/placement-groups/ we > >>> computed that we need to increase our per pool pg_num & pgp_num to > >>> appr 6300 PGs / pool (100 * 126 / 2). > >>> We started by increasing the pg & pgp number on the smaller pool from > >>> 1800 to 2048 PGs (first the pg_num, then the pgp_num) and we > >>> experienced a 10X increase in Ceph total operations and an appr 3X > >>> disk latency increase in some underlying OSD disks. At the same time, > >>> for appr 10 seconds we experienced very low values of client io and > >>> op/s > >>> > >>> Should we be worried that the pg/pgp num increase on the bigger pool > >>> will have a 300X larger impact? > >>> Can we throttle this impact by injecting any thresholds or applying an > >>> appropriate configuration on our ceph conf? > >>> > >>> Regards, > >>> Kostis > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users at lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> > >> > >> -- > >> Software Engineer #42 @ http://inktank.com | http://ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140709/6898b11d/attachment.htm>