Uneven OSD usage

j.david.lists@xxxxxxxxx (J David) · Fri, 29 Aug 2014 02:32:39 -0400

On Thu, Aug 28, 2014 at 10:47 PM, Christian Balzer <chibi at gol.com> wrote:
>> There are 1328 PG's in the pool, so about 110 per OSD.
>>
> And just to be pedantic, the PGP_NUM is the same?

Ah, "ceph status" reports 1328 pgs.  But:

$ sudo ceph osd pool get rbd pg_num
pg_num: 1200
$ sudo ceph osd pool get rbd pgp_num
pgp_num: 1200

Now, 1200 is not a power of two, but it makes sense.  (12 x 100).
Probably we forewent the power of two because it was such a huge
increase and we were already erring large.

Apparently the 1328 figure includes 128 pg's for the (unused in our
case) data and metadata pools.

> Since you can't go down, the only way is up. To 2048
> See it as an early preparation step towards the time when you reach 48
> OSDs. ^o^

Demand for this cluster exceeds all estimates and plans, so that may
be (much) sooner than expected!

To start with, I bumped the 1200 pg's to 1280, figuring that at least
it was power-of-twoier (tm) than 1200, and that I could then add 256
at a time.

However, the increase to 1280 caused several OSD's to spike up over
85% and wedged a bunch of pg's in active+remapped+backfill_toofull.
To fix it, I had to change "osd backfill full ratio = 0.90" in the
ceph.conf and manually restart all the OSD's.  That was pretty
unsettling on a production cluster, so I'm definitely hesitant to
raise it any more if there's any chance increasing it could push
individual OSD's over 90%.

It's just so frustrating to have one OSD at 74% and another at 88% and
be taking "near full" warnings as a result.  The data could just move
over a little and everything would be fine.  Feels like Happy Gilmore.
"Why don't you just go home? That's your home!! Are you too good for
your home?!?"

Thanks!