Uneven OSD usage

chibi@xxxxxxx (Christian Balzer) · Fri, 29 Aug 2014 15:53:16 +0900

Hello,

On Fri, 29 Aug 2014 02:32:39 -0400 J David wrote:

> On Thu, Aug 28, 2014 at 10:47 PM, Christian Balzer <chibi at gol.com> wrote:
> >> There are 1328 PG's in the pool, so about 110 per OSD.
> >>
> > And just to be pedantic, the PGP_NUM is the same?
> 
> Ah, "ceph status" reports 1328 pgs.  But:
> 
> $ sudo ceph osd pool get rbd pg_num
> pg_num: 1200
> $ sudo ceph osd pool get rbd pgp_num
> pgp_num: 1200
> 
> Now, 1200 is not a power of two, but it makes sense.  (12 x 100).
Should have been 600 and then upped to 1024.

> Probably we forewent the power of two because it was such a huge
> increase and we were already erring large.
> 
Which unfortunately in my experience is what you have to do if you want
even distribution with smallish clusters.

> Apparently the 1328 figure includes 128 pg's for the (unused in our
> case) data and metadata pools.
> 
Indeed.

> > Since you can't go down, the only way is up. To 2048
> > See it as an early preparation step towards the time when you reach 48
> > OSDs. ^o^
> 
> Demand for this cluster exceeds all estimates and plans, so that may
> be (much) sooner than expected!
> 
> To start with, I bumped the 1200 pg's to 1280, figuring that at least
> it was power-of-twoier (tm) than 1200, and that I could then add 256
> at a time.
> 
> However, the increase to 1280 caused several OSD's to spike up over
> 85% and wedged a bunch of pg's in active+remapped+backfill_toofull.
> To fix it, I had to change "osd backfill full ratio = 0.90" in the
> ceph.conf and manually restart all the OSD's.  That was pretty
> unsettling on a production cluster, so I'm definitely hesitant to
> raise it any more if there's any chance increasing it could push
> individual OSD's over 90%.
> 
Yeah, it probably won't get better (as in more even) until you reach 2048.
Something else that comes to mind is the overhead in terms of disk space
used for the additional PGs (directories), but that shouldn't be a real
factor. 

If you're using RBD to for VM images, you might be able to get space back
by doing a fstrim on those images from inside the VM. 
For that to work qemu needs to mount them as IDE or virtio-scsi, though.

Regards,

Christian
> It's just so frustrating to have one OSD at 74% and another at 88% and
> be taking "near full" warnings as a result.  The data could just move
> over a little and everything would be fine.  Feels like Happy Gilmore.
> "Why don't you just go home? That's your home!! Are you too good for
> your home?!?"
> 
> Thanks!
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/