Hi Christian, Ha ... root at osd45:~# ceph osd pool get rbd pg_num pg_num: 128 root at osd45:~# ceph osd pool get rbd pgp_num pgp_num: 64 That's the explanation! I did run the command but it spit out some (what I thought was a harmless) warning; should have checked more carefully. I now have the expected data movement. Thanks alot! JR On 9/8/2014 10:04 PM, Christian Balzer wrote: > > Hello, > > On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote: > >> Hi Christian, all, >> >> Having researched this a bit more, it seemed that just doing >> >> ceph osd pool set rbd pg_num 128 >> ceph osd pool set rbd pgp_num 128 >> >> might be the answer. Alas, it was not. After running the above the >> cluster just sat there. >> > Really now? No data movement, no health warnings during that in the logs, > no other error in the logs or when issuing that command? > Is it really at 128 now, verified with "ceph osd pool get rbd pg_num"? > > You really want to get this addressed as per the previous reply before > doing anything further. Because with just 64 PGs (as in only 8 per OSD!) > massive imbalances are a given. > >> Finally, reading some more, I ran: >> >> ceph osd reweight-by-utilization >> > Reading can be dangerous. ^o^ > > I didn't mention this, as it never worked for me in any predictable way > and with a desirable outcome, especially in situations like yours. > >> This accomplished moving the utilization of the first drive on the >> affected node to the 2nd drive! .e.g.: >> >> ------- >> BEFORE RUNNING: >> ------- >> Filesystem Use% >> /dev/sdc1 57% >> /dev/sdb1 65% >> Filesystem Use% >> /dev/sdc1 90% >> /dev/sdb1 75% >> Filesystem Use% >> /dev/sdb1 52% >> /dev/sdc1 52% >> Filesystem Use% >> /dev/sdc1 54% >> /dev/sdb1 63% >> >> ------- >> AFTER RUNNING: >> ------- >> Filesystem Use% >> /dev/sdc1 57% >> /dev/sdb1 65% >> Filesystem Use% >> /dev/sdc1 70% ** these two swapped (roughly) ** >> /dev/sdb1 92% ** ^^^^^ ^^^ ^^^^^^^ ** >> Filesystem Use% >> /dev/sdb1 52% >> /dev/sdc1 52% >> Filesystem Use% >> /dev/sdc1 54% >> /dev/sdb1 63% >> >> root at osd45:~# ceph osd tree >> # id weight type name up/down reweight >> -1 3.44 root default >> -2 0.86 host osd45 >> 0 0.43 osd.0 up 1 >> 4 0.43 osd.4 up 1 >> -3 0.86 host osd42 >> 1 0.43 osd.1 up 1 >> 5 0.43 osd.5 up 1 >> -4 0.86 host osd44 >> 2 0.43 osd.2 up 1 >> 6 0.43 osd.6 up 1 >> -5 0.86 host osd43 >> 3 0.43 osd.3 up 1 >> 7 0.43 osd.7 up 0.7007 >> >> So this isn't the answer either. >> > It might have been, if it had more PGs to distribute things along, see > above. But even then with the default dumpling tunables it might not be > much better. > >> Could someone please chime in with an explanation/suggestion? >> >> I suspect that might make sense to use 'ceph osd reweight osd.7 1' and >> then run some form of 'ceph osd crush ...'? >> > No need to crush anything, reweight it to 1 after adding PGs/PGPs and > after all that data movement has finished slowly dial down any still > overly utilized OSD. > > Also per the "Uneven OSD usage" thread, you might run into a "full" > situation during data re-distribution. Increase PGs in small (64) > increments. > >> Of course, I've read a number of things which suggest that the two >> things I've done should have fixed my problem. >> >> Is it (gasp!) possible that this, as Christian suggests, is a dumpling >> issue and, were I running on firefly, it would be sufficient? >> > Running Firefly with all the tunables and probably hashpspool. > Most of the tunables with the exception of "chooseleaf_vary_r" are > available on dumpling, hashpspool isn't AFAIK. > See http://ceph.com/docs/master/rados/operations/crush-map/#tunables > > Christian >> >> Thanks much >> JR >> On 9/8/2014 1:50 PM, JR wrote: >>> Hi Christian, >>> >>> I have 448 PGs and 448 PGPs (according to ceph -s). >>> >>> This seems borne out by: >>> >>> root at osd45:~# rados lspools >>> data >>> metadata >>> rbd >>> volumes >>> images >>> root at osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd pool >>> get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done >>> data pg(pg_num: 64, pgppg_num: 64 >>> metadata pg(pg_num: 64, pgppg_num: 64 >>> rbd pg(pg_num: 64, pgppg_num: 64 >>> volumes pg(pg_num: 128, pgppg_num: 128 >>> images pg(pg_num: 128, pgppg_num: 128 >>> >>> According to the formula discussed in 'Uneven OSD usage,' >>> >>> "The formula is actually OSDs * 100 / replication >>> >>> in my case: >>> >>> 8*100/2=400 >>> >>> So I'm erroring on the large size? >>> >>> Or, does this formula apply on by pool basis? Of my 5 pools I'm using >>> 3: >>> >>> root at osd45:~# rados df|cut -c1-45 >>> pool name category KB >>> data - 0 >>> images - 0 >>> metadata - 10 >>> rbd - 568489533 >>> volumes - 594078601 >>> total used 2326235048 285923 >>> total avail 1380814968 >>> total space 3707050016 >>> >>> So should I up the number of PGs for the rbd and volumes pools? >>> >>> I'll continue looking at docs, but for now I'll send this off. >>> >>> Thanks very much, Christain. >>> >>> ps. This cluster is self-contained and all nodes in it are completely >>> loaded (i.e., I can't add any more nodes nor disks). It's also not an >>> option at the moment to upgrade to firefly (can't make a big change >>> before sending it out the door). >>> >>> >>> >>> On 9/8/2014 12:09 PM, Christian Balzer wrote: >>>> >>>> Hello, >>>> >>>> On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote: >>>> >>>>> Greetings all, >>>>> >>>>> I have a small ceph cluster (4 nodes, 2 osds per node) which recently >>>>> started showing: >>>>> >>>>> root at ocd45:~# ceph health >>>>> HEALTH_WARN 1 near full osd(s) >>>>> >>>>> admin at node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep >>>>> 'Filesystem|osd/ceph'; done >>>>> Filesystem Size Used Avail Use% Mounted on >>>>> /dev/sdc1 442G 249G 194G 57% /var/lib/ceph/osd/ceph-5 >>>>> /dev/sdb1 442G 287G 156G 65% /var/lib/ceph/osd/ceph-1 >>>>> Filesystem Size Used Avail Use% Mounted on >>>>> /dev/sdc1 442G 396G 47G 90% /var/lib/ceph/osd/ceph-7 >>>>> /dev/sdb1 442G 316G 127G 72% /var/lib/ceph/osd/ceph-3 >>>>> Filesystem Size Used Avail Use% Mounted on >>>>> /dev/sdb1 442G 229G 214G 52% /var/lib/ceph/osd/ceph-2 >>>>> /dev/sdc1 442G 229G 214G 52% /var/lib/ceph/osd/ceph-6 >>>>> Filesystem Size Used Avail Use% Mounted on >>>>> /dev/sdc1 442G 238G 205G 54% /var/lib/ceph/osd/ceph-4 >>>>> /dev/sdb1 442G 278G 165G 63% /var/lib/ceph/osd/ceph-0 >>>>> >>>>> >>>> See the very recent "Uneven OSD usage" for a discussion about this. >>>> What are your PG/PGP values? >>>> >>>>> This cluster has been running for weeks, under significant load, and >>>>> has been 100% stable. Unfortunately we have to ship it out of the >>>>> building to another part of our business (where we will have little >>>>> access to it). >>>>> >>>>> Based on what I've read about 'ceph osd reweight' I'm a bit hesitant >>>>> to just run it (I don't want to do anything that impacts this >>>>> cluster's stability). >>>>> >>>>> Is there another, better way to equalize the distribution the data on >>>>> the osd partitions? >>>>> >>>>> I'm running dumpling. >>>>> >>>> As per the thread and my experience, Firefly would solve this. If you >>>> can upgrade during a weekend or whenever there is little to no >>>> access, do it. >>>> >>>> Another option (of course any and all of these will result in data >>>> movement, so pick an appropriate time), would be to "use ceph osd >>>> reweight" to lower the weight of osd.7 in particular. >>>> >>>> Lastly, given the utilization of your cluster, your really ought to >>>> deploy more OSDs and/or more nodes, if a node would go down you'd >>>> easily get into a "real" near full or full situation. >>>> >>>> Regards, >>>> >>>> Christian >>>> >>> >> > > -- Your electronic communications are being monitored; strong encryption is an answer. My public key <http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x4F08C504BD634953>