Hello, On Sat, 30 Aug 2014 18:27:22 -0400 J David wrote: > On Fri, Aug 29, 2014 at 2:53 AM, Christian Balzer <chibi at gol.com> wrote: > >> Now, 1200 is not a power of two, but it makes sense. (12 x 100). > > Should have been 600 and then upped to 1024. > > At the time, there was a reason why doing that did not work, but I > don't remember the specifics. All messages sent back in time telling > then-us to try harder or make better notes have thusfar been ignored. > > >> Probably we forewent the power of two because it was such a huge > >> increase and we were already erring large. > >> > > Which unfortunately in my experience is what you have to do if you want > > even distribution with smallish clusters. > > In the end, this made no difference. By slipping one more OSD into > the fray, I was able to bring the average utilization down enough to > inch up to 2048 PG's. It had basically no effect on how evenly the > OSD's are used. (Counting the new OSD, which is only 62% used, things > have actually gotten worse.) Here are the current df's: > I wonder if there's something going on other than just uneven PG distribution, but as to what this would be aside from ridiculous FS overhead or maybe the omap (../current/omap) leveldb going into megabloat I don't know. I see nothing more than 10% deviation here with 3 clusters. > Node 1: > /dev/sda2 358G 269G 89G 76% /var/lib/ceph/osd/ceph-0 > /dev/sdb2 358G 310G 49G 87% /var/lib/ceph/osd/ceph-1 > /dev/sdc2 358G 286G 73G 80% /var/lib/ceph/osd/ceph-2 > /dev/sdd2 358G 287G 71G 81% /var/lib/ceph/osd/ceph-3 > > Node 2: > /dev/sda2 358G 288G 70G 81% /var/lib/ceph/osd/ceph-4 > /dev/sdd2 358G 311G 48G 87% /var/lib/ceph/osd/ceph-9 > /dev/sdc2 358G 278G 81G 78% /var/lib/ceph/osd/ceph-10 > /dev/sdb2 358G 296G 62G 83% /var/lib/ceph/osd/ceph-11 > > Node 3: > /dev/sda2 358G 291G 67G 82% /var/lib/ceph/osd/ceph-5 > /dev/sdb2 358G 296G 63G 83% /var/lib/ceph/osd/ceph-6 > /dev/sdc2 358G 298G 61G 84% /var/lib/ceph/osd/ceph-7 > /dev/sdd2 358G 282G 77G 79% /var/lib/ceph/osd/ceph-8 > > Node 4: > /dev/sdb2 358G 219G 140G 62% /var/lib/ceph/osd/ceph-12 > I was going to ask you what version of ceph you're running, but that got answered by your other thread just now. Firefly has by default improved Crush tunables and thus placement groups distribution, however doing the change of those tunables is something best done on an idle cluster during the weekend. The one tunable mostly responsible for better distribution seems to be "chooseleaf_vary_r" (somebody from the ceph team correct me if I'm wrong), see the end of: http://ceph.com/docs/master/rados/operations/crush-map/ That one is available within emperor if you don't/can't go to Firefly. Christian > > If you're using RBD to for VM images, you might be able to get space > > back by doing a fstrim on those images from inside the VM. > > This isn't really about getting space back; we can buy more space if > we need it. It's about not having stuff (like backfilling) fail > because a 1-2 OSDs are at 87% when the average use is <80%. > > So it seems like we're back to square one in terms of balancing out > our OSD's. Is there a way to do it? > > Thanks! > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/