Uneven OSD usage

chibi@xxxxxxx (Christian Balzer) · Sun, 31 Aug 2014 13:00:26 +0900

Hello,

On Sat, 30 Aug 2014 18:27:22 -0400 J David wrote:

> On Fri, Aug 29, 2014 at 2:53 AM, Christian Balzer <chibi at gol.com> wrote:
> >> Now, 1200 is not a power of two, but it makes sense.  (12 x 100).
> > Should have been 600 and then upped to 1024.
> 
> At the time, there was a reason why doing that did not work, but I
> don't remember the specifics.  All messages sent back in time telling
> then-us to try harder or make better notes have thusfar been ignored.
> 
> >> Probably we forewent the power of two because it was such a huge
> >> increase and we were already erring large.
> >>
> > Which unfortunately in my experience is what you have to do if you want
> > even distribution with smallish clusters.
> 
> In the end, this made no difference.  By slipping one more OSD into
> the fray, I was able to bring the average utilization down enough to
> inch up to 2048 PG's.  It had basically no effect on how evenly the
> OSD's are used.  (Counting the new OSD, which is only 62% used, things
> have actually gotten worse.)  Here are the current df's:
>
I wonder if there's something going on other than just uneven PG
distribution, but as to what this would be aside from ridiculous FS
overhead or maybe the omap (../current/omap) leveldb going into megabloat I
don't know.
I see nothing more than 10% deviation here with 3 clusters.

> Node 1:
> /dev/sda2       358G  269G   89G  76% /var/lib/ceph/osd/ceph-0
> /dev/sdb2       358G  310G   49G  87% /var/lib/ceph/osd/ceph-1
> /dev/sdc2       358G  286G   73G  80% /var/lib/ceph/osd/ceph-2
> /dev/sdd2       358G  287G   71G  81% /var/lib/ceph/osd/ceph-3
> 
> Node 2:
> /dev/sda2       358G  288G   70G  81% /var/lib/ceph/osd/ceph-4
> /dev/sdd2       358G  311G   48G  87% /var/lib/ceph/osd/ceph-9
> /dev/sdc2       358G  278G   81G  78% /var/lib/ceph/osd/ceph-10
> /dev/sdb2       358G  296G   62G  83% /var/lib/ceph/osd/ceph-11
> 
> Node 3:
> /dev/sda2       358G  291G   67G  82% /var/lib/ceph/osd/ceph-5
> /dev/sdb2       358G  296G   63G  83% /var/lib/ceph/osd/ceph-6
> /dev/sdc2       358G  298G   61G  84% /var/lib/ceph/osd/ceph-7
> /dev/sdd2       358G  282G   77G  79% /var/lib/ceph/osd/ceph-8
> 
> Node 4:
> /dev/sdb2       358G  219G  140G  62% /var/lib/ceph/osd/ceph-12
> 
I was going to ask you what version of ceph you're running, but that got
answered by your other thread just now.
Firefly has by default improved Crush tunables and thus placement groups
distribution, however doing the change of those tunables is something best
done on an idle cluster during the weekend.

The one tunable mostly responsible for better distribution seems to be
"chooseleaf_vary_r" (somebody from the ceph team correct me if I'm wrong),
see the end of: http://ceph.com/docs/master/rados/operations/crush-map/

That one is available within emperor if you don't/can't go to Firefly.

Christian

> > If you're using RBD to for VM images, you might be able to get space
> > back by doing a fstrim on those images from inside the VM.
> 
> This isn't really about getting space back; we can buy more space if
> we need it.  It's about not having stuff (like backfilling) fail
> because a 1-2 OSDs are at 87% when the average use is <80%.
> 
> So it seems like we're back to square one in terms of balancing out
> our OSD's.  Is there a way to do it?
> 
> Thanks!
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/