You're physically moving (lots of) data around between most of your disks. There's going to be an IO impact from that, although we are always working on ways to make it more controllable and try to minimize its impact. Your average latency increase sounds a little high to me, but I don't have much data to draw from; maybe others who have done this on large clusters can discuss. Basically, think of what happens to IO performance on a resilvering RAID array. We should be a lot better than that, but it's the same concept. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Jul 8, 2014 at 11:15 PM, Kostis Fardelas <dante1234 at gmail.com> wrote: > Hi Greg, > thanks for your immediate feedback. My comments follow. > > Initially we thought that the 248 PG (15%) increment we used was > really small, but it seems that we should increase PGs in even small > increments. I think that the term "multiples" is not the appropriate > term here, I fear someone would assume that it is the same (or even > the right way to do) to go from 10 PGs to 20 PGs and from 1000 PGs to > 2000 PGs just because he/she uses a small 2X multiple. > > Regarding, the data movement due to pgp_num increase, we had already > set osd_max_backfills, osd_recovery_max_active, > osd_recovery_op_priority, osd_recovery_threads to their minimum values > but we still got impacted. The first two are also set in ceph.conf but > we use to change all four of them at runtime (through injecting). Is > there anything else we should check? Is it some known issue? > > Another question that came up from our exercise is related to pool > isolation during PG remapping. As I reported we only changed the > pg/pgp num in one of our pools but ceph client io and ceph ops seem to > have dropped at cluster level (verified by looking at ceph status). > Did our second pool got impacted too or we should take from granted > that the pools are indeed isolated during remapping and there is a > ceph status view granularity issue here? > > Regards, > Kostis > > On 8 July 2014 20:01, Gregory Farnum <greg at inktank.com> wrote: >> The impact won't be 300 times bigger, but it will be bigger. There are two >> things impacting your cluster here >> 1) the initial "split" of the affected PGs into multiple child PGs. You can >> mitigate this by stepping through pg_num at small multiples. >> 2) the movement of data to its new location (when you adjust pgp_num). This >> can be adjusted by setting the "OSD max backfills" and related parameters; >> check the docs. >> -Greg >> >> >> On Tuesday, July 8, 2014, Kostis Fardelas <dante1234 at gmail.com> wrote: >>> >>> Hi, >>> we maintain a cluster with 126 OSDs, replication 3 and appr. 148T raw >>> used space. We store data objects basically on two pools, the one >>> being appr. 300x larger in data stored and # of objects terms than the >>> other. Based on the formula provided here >>> http://ceph.com/docs/master/rados/operations/placement-groups/ we >>> computed that we need to increase our per pool pg_num & pgp_num to >>> appr 6300 PGs / pool (100 * 126 / 2). >>> We started by increasing the pg & pgp number on the smaller pool from >>> 1800 to 2048 PGs (first the pg_num, then the pgp_num) and we >>> experienced a 10X increase in Ceph total operations and an appr 3X >>> disk latency increase in some underlying OSD disks. At the same time, >>> for appr 10 seconds we experienced very low values of client io and >>> op/s >>> >>> Should we be worried that the pg/pgp num increase on the bigger pool >>> will have a 300X larger impact? >>> Can we throttle this impact by injecting any thresholds or applying an >>> appropriate configuration on our ceph conf? >>> >>> Regards, >>> Kostis >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users at lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> -- >> Software Engineer #42 @ http://inktank.com | http://ceph.com