Hello, On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote: > Hi Christian, all, > > Having researched this a bit more, it seemed that just doing > > ceph osd pool set rbd pg_num 128 > ceph osd pool set rbd pgp_num 128 > > might be the answer. Alas, it was not. After running the above the > cluster just sat there. > Really now? No data movement, no health warnings during that in the logs, no other error in the logs or when issuing that command? Is it really at 128 now, verified with "ceph osd pool get rbd pg_num"? You really want to get this addressed as per the previous reply before doing anything further. Because with just 64 PGs (as in only 8 per OSD!) massive imbalances are a given. > Finally, reading some more, I ran: > > ceph osd reweight-by-utilization > Reading can be dangerous. ^o^ I didn't mention this, as it never worked for me in any predictable way and with a desirable outcome, especially in situations like yours. > This accomplished moving the utilization of the first drive on the > affected node to the 2nd drive! .e.g.: > > ------- > BEFORE RUNNING: > ------- > Filesystem Use% > /dev/sdc1 57% > /dev/sdb1 65% > Filesystem Use% > /dev/sdc1 90% > /dev/sdb1 75% > Filesystem Use% > /dev/sdb1 52% > /dev/sdc1 52% > Filesystem Use% > /dev/sdc1 54% > /dev/sdb1 63% > > ------- > AFTER RUNNING: > ------- > Filesystem Use% > /dev/sdc1 57% > /dev/sdb1 65% > Filesystem Use% > /dev/sdc1 70% ** these two swapped (roughly) ** > /dev/sdb1 92% ** ^^^^^ ^^^ ^^^^^^^ ** > Filesystem Use% > /dev/sdb1 52% > /dev/sdc1 52% > Filesystem Use% > /dev/sdc1 54% > /dev/sdb1 63% > > root at osd45:~# ceph osd tree > # id weight type name up/down reweight > -1 3.44 root default > -2 0.86 host osd45 > 0 0.43 osd.0 up 1 > 4 0.43 osd.4 up 1 > -3 0.86 host osd42 > 1 0.43 osd.1 up 1 > 5 0.43 osd.5 up 1 > -4 0.86 host osd44 > 2 0.43 osd.2 up 1 > 6 0.43 osd.6 up 1 > -5 0.86 host osd43 > 3 0.43 osd.3 up 1 > 7 0.43 osd.7 up 0.7007 > > So this isn't the answer either. > It might have been, if it had more PGs to distribute things along, see above. But even then with the default dumpling tunables it might not be much better. > Could someone please chime in with an explanation/suggestion? > > I suspect that might make sense to use 'ceph osd reweight osd.7 1' and > then run some form of 'ceph osd crush ...'? > No need to crush anything, reweight it to 1 after adding PGs/PGPs and after all that data movement has finished slowly dial down any still overly utilized OSD. Also per the "Uneven OSD usage" thread, you might run into a "full" situation during data re-distribution. Increase PGs in small (64) increments. > Of course, I've read a number of things which suggest that the two > things I've done should have fixed my problem. > > Is it (gasp!) possible that this, as Christian suggests, is a dumpling > issue and, were I running on firefly, it would be sufficient? > Running Firefly with all the tunables and probably hashpspool. Most of the tunables with the exception of "chooseleaf_vary_r" are available on dumpling, hashpspool isn't AFAIK. See http://ceph.com/docs/master/rados/operations/crush-map/#tunables Christian > > Thanks much > JR > On 9/8/2014 1:50 PM, JR wrote: > > Hi Christian, > > > > I have 448 PGs and 448 PGPs (according to ceph -s). > > > > This seems borne out by: > > > > root at osd45:~# rados lspools > > data > > metadata > > rbd > > volumes > > images > > root at osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd pool > > get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done > > data pg(pg_num: 64, pgppg_num: 64 > > metadata pg(pg_num: 64, pgppg_num: 64 > > rbd pg(pg_num: 64, pgppg_num: 64 > > volumes pg(pg_num: 128, pgppg_num: 128 > > images pg(pg_num: 128, pgppg_num: 128 > > > > According to the formula discussed in 'Uneven OSD usage,' > > > > "The formula is actually OSDs * 100 / replication > > > > in my case: > > > > 8*100/2=400 > > > > So I'm erroring on the large size? > > > > Or, does this formula apply on by pool basis? Of my 5 pools I'm using > > 3: > > > > root at osd45:~# rados df|cut -c1-45 > > pool name category KB > > data - 0 > > images - 0 > > metadata - 10 > > rbd - 568489533 > > volumes - 594078601 > > total used 2326235048 285923 > > total avail 1380814968 > > total space 3707050016 > > > > So should I up the number of PGs for the rbd and volumes pools? > > > > I'll continue looking at docs, but for now I'll send this off. > > > > Thanks very much, Christain. > > > > ps. This cluster is self-contained and all nodes in it are completely > > loaded (i.e., I can't add any more nodes nor disks). It's also not an > > option at the moment to upgrade to firefly (can't make a big change > > before sending it out the door). > > > > > > > > On 9/8/2014 12:09 PM, Christian Balzer wrote: > >> > >> Hello, > >> > >> On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote: > >> > >>> Greetings all, > >>> > >>> I have a small ceph cluster (4 nodes, 2 osds per node) which recently > >>> started showing: > >>> > >>> root at ocd45:~# ceph health > >>> HEALTH_WARN 1 near full osd(s) > >>> > >>> admin at node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep > >>> 'Filesystem|osd/ceph'; done > >>> Filesystem Size Used Avail Use% Mounted on > >>> /dev/sdc1 442G 249G 194G 57% /var/lib/ceph/osd/ceph-5 > >>> /dev/sdb1 442G 287G 156G 65% /var/lib/ceph/osd/ceph-1 > >>> Filesystem Size Used Avail Use% Mounted on > >>> /dev/sdc1 442G 396G 47G 90% /var/lib/ceph/osd/ceph-7 > >>> /dev/sdb1 442G 316G 127G 72% /var/lib/ceph/osd/ceph-3 > >>> Filesystem Size Used Avail Use% Mounted on > >>> /dev/sdb1 442G 229G 214G 52% /var/lib/ceph/osd/ceph-2 > >>> /dev/sdc1 442G 229G 214G 52% /var/lib/ceph/osd/ceph-6 > >>> Filesystem Size Used Avail Use% Mounted on > >>> /dev/sdc1 442G 238G 205G 54% /var/lib/ceph/osd/ceph-4 > >>> /dev/sdb1 442G 278G 165G 63% /var/lib/ceph/osd/ceph-0 > >>> > >>> > >> See the very recent "Uneven OSD usage" for a discussion about this. > >> What are your PG/PGP values? > >> > >>> This cluster has been running for weeks, under significant load, and > >>> has been 100% stable. Unfortunately we have to ship it out of the > >>> building to another part of our business (where we will have little > >>> access to it). > >>> > >>> Based on what I've read about 'ceph osd reweight' I'm a bit hesitant > >>> to just run it (I don't want to do anything that impacts this > >>> cluster's stability). > >>> > >>> Is there another, better way to equalize the distribution the data on > >>> the osd partitions? > >>> > >>> I'm running dumpling. > >>> > >> As per the thread and my experience, Firefly would solve this. If you > >> can upgrade during a weekend or whenever there is little to no > >> access, do it. > >> > >> Another option (of course any and all of these will result in data > >> movement, so pick an appropriate time), would be to "use ceph osd > >> reweight" to lower the weight of osd.7 in particular. > >> > >> Lastly, given the utilization of your cluster, your really ought to > >> deploy more OSDs and/or more nodes, if a node would go down you'd > >> easily get into a "real" near full or full situation. > >> > >> Regards, > >> > >> Christian > >> > > > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/