Greetings After running for a couple of hours, my attempt to re-balance a near ful disk has stopped with a stuck unclean error: root at osd45:~# ceph -s cluster c8122868-27af-11e4-b570-52540004010f health HEALTH_WARN 6 pgs backfilling; 6 pgs stuck unclean; recovery 13086/1158268 degraded (1.130%) monmap e1: 3 mons at {osd42=10.7.7.142:6789/0,osd43=10.7.7.143:6789/0,osd45=10.7.7.145:6789/0}, election epoch 80, quorum 0,1,2 osd42,osd43,osd45 osdmap e723: 8 osds: 8 up, 8 in pgmap v543113: 640 pgs: 634 active+clean, 6 active+remapped+backfilling; 2222 GB data, 2239 GB used, 1295 GB / 3535 GB avail; 8268B/s wr, 0op/s; 13086/1158268 degraded (1.130%) mdsmap e63: 1/1/1 up {0=osd42=up:active}, 3 up:standby The sequence of events today that led to this were: # starting state: pg_num/pgp_num == 64 ceph osd pool set rbd pg_num 128 ceph osd pool set rbd pgp_num 128 # there was a warning thrown up (which I've lost) and which left pgg_num == 64 # nothing happens since pgp_num was inadvertently not raised ceph osd reweight-by-utilization # data moves from one osd on a host to another osd on same host ceph osd reweight 7 1 # data moves back to roughly what it had been ceph osd pool set volumes pg_num 192 ceph osd pool set volumes pgp_num 192 # data moves successfully ceph osd pool set rbd pg_num 192 ceph osd pool set rbd pgp_num 192 # data stuck googling (nowadays known as research) reveals that these might be helpful: - ceph osd crush tunables optimal - setting crush weights to 1 I resist doing anything for now in the hopes that someone has something coherent to say (Christian? ;-) Thanks JR On 9/8/2014 10:37 PM, JR wrote: > Hi Christian, > > Ha ... > > root at osd45:~# ceph osd pool get rbd pg_num > pg_num: 128 > root at osd45:~# ceph osd pool get rbd pgp_num > pgp_num: 64 > > That's the explanation! I did run the command but it spit out some > (what I thought was a harmless) warning; should have checked more carefully. > > I now have the expected data movement. > > Thanks alot! > JR > > On 9/8/2014 10:04 PM, Christian Balzer wrote: >> >> Hello, >> >> On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote: >> >>> Hi Christian, all, >>> >>> Having researched this a bit more, it seemed that just doing >>> >>> ceph osd pool set rbd pg_num 128 >>> ceph osd pool set rbd pgp_num 128 >>> >>> might be the answer. Alas, it was not. After running the above the >>> cluster just sat there. >>> >> Really now? No data movement, no health warnings during that in the logs, >> no other error in the logs or when issuing that command? >> Is it really at 128 now, verified with "ceph osd pool get rbd pg_num"? >> >> You really want to get this addressed as per the previous reply before >> doing anything further. Because with just 64 PGs (as in only 8 per OSD!) >> massive imbalances are a given. >> >>> Finally, reading some more, I ran: >>> >>> ceph osd reweight-by-utilization >>> >> Reading can be dangerous. ^o^ >> >> I didn't mention this, as it never worked for me in any predictable way >> and with a desirable outcome, especially in situations like yours. >> >>> This accomplished moving the utilization of the first drive on the >>> affected node to the 2nd drive! .e.g.: >>> >>> ------- >>> BEFORE RUNNING: >>> ------- >>> Filesystem Use% >>> /dev/sdc1 57% >>> /dev/sdb1 65% >>> Filesystem Use% >>> /dev/sdc1 90% >>> /dev/sdb1 75% >>> Filesystem Use% >>> /dev/sdb1 52% >>> /dev/sdc1 52% >>> Filesystem Use% >>> /dev/sdc1 54% >>> /dev/sdb1 63% >>> >>> ------- >>> AFTER RUNNING: >>> ------- >>> Filesystem Use% >>> /dev/sdc1 57% >>> /dev/sdb1 65% >>> Filesystem Use% >>> /dev/sdc1 70% ** these two swapped (roughly) ** >>> /dev/sdb1 92% ** ^^^^^ ^^^ ^^^^^^^ ** >>> Filesystem Use% >>> /dev/sdb1 52% >>> /dev/sdc1 52% >>> Filesystem Use% >>> /dev/sdc1 54% >>> /dev/sdb1 63% >>> >>> root at osd45:~# ceph osd tree >>> # id weight type name up/down reweight >>> -1 3.44 root default >>> -2 0.86 host osd45 >>> 0 0.43 osd.0 up 1 >>> 4 0.43 osd.4 up 1 >>> -3 0.86 host osd42 >>> 1 0.43 osd.1 up 1 >>> 5 0.43 osd.5 up 1 >>> -4 0.86 host osd44 >>> 2 0.43 osd.2 up 1 >>> 6 0.43 osd.6 up 1 >>> -5 0.86 host osd43 >>> 3 0.43 osd.3 up 1 >>> 7 0.43 osd.7 up 0.7007 >>> >>> So this isn't the answer either. >>> >> It might have been, if it had more PGs to distribute things along, see >> above. But even then with the default dumpling tunables it might not be >> much better. >> >>> Could someone please chime in with an explanation/suggestion? >>> >>> I suspect that might make sense to use 'ceph osd reweight osd.7 1' and >>> then run some form of 'ceph osd crush ...'? >>> >> No need to crush anything, reweight it to 1 after adding PGs/PGPs and >> after all that data movement has finished slowly dial down any still >> overly utilized OSD. >> >> Also per the "Uneven OSD usage" thread, you might run into a "full" >> situation during data re-distribution. Increase PGs in small (64) >> increments. >> >>> Of course, I've read a number of things which suggest that the two >>> things I've done should have fixed my problem. >>> >>> Is it (gasp!) possible that this, as Christian suggests, is a dumpling >>> issue and, were I running on firefly, it would be sufficient? >>> >> Running Firefly with all the tunables and probably hashpspool. >> Most of the tunables with the exception of "chooseleaf_vary_r" are >> available on dumpling, hashpspool isn't AFAIK. >> See http://ceph.com/docs/master/rados/operations/crush-map/#tunables >> >> Christian >>> >>> Thanks much >>> JR >>> On 9/8/2014 1:50 PM, JR wrote: >>>> Hi Christian, >>>> >>>> I have 448 PGs and 448 PGPs (according to ceph -s). >>>> >>>> This seems borne out by: >>>> >>>> root at osd45:~# rados lspools >>>> data >>>> metadata >>>> rbd >>>> volumes >>>> images >>>> root at osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd pool >>>> get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done >>>> data pg(pg_num: 64, pgppg_num: 64 >>>> metadata pg(pg_num: 64, pgppg_num: 64 >>>> rbd pg(pg_num: 64, pgppg_num: 64 >>>> volumes pg(pg_num: 128, pgppg_num: 128 >>>> images pg(pg_num: 128, pgppg_num: 128 >>>> >>>> According to the formula discussed in 'Uneven OSD usage,' >>>> >>>> "The formula is actually OSDs * 100 / replication >>>> >>>> in my case: >>>> >>>> 8*100/2=400 >>>> >>>> So I'm erroring on the large size? >>>> >>>> Or, does this formula apply on by pool basis? Of my 5 pools I'm using >>>> 3: >>>> >>>> root at osd45:~# rados df|cut -c1-45 >>>> pool name category KB >>>> data - 0 >>>> images - 0 >>>> metadata - 10 >>>> rbd - 568489533 >>>> volumes - 594078601 >>>> total used 2326235048 285923 >>>> total avail 1380814968 >>>> total space 3707050016 >>>> >>>> So should I up the number of PGs for the rbd and volumes pools? >>>> >>>> I'll continue looking at docs, but for now I'll send this off. >>>> >>>> Thanks very much, Christain. >>>> >>>> ps. This cluster is self-contained and all nodes in it are completely >>>> loaded (i.e., I can't add any more nodes nor disks). It's also not an >>>> option at the moment to upgrade to firefly (can't make a big change >>>> before sending it out the door). >>>> >>>> >>>> >>>> On 9/8/2014 12:09 PM, Christian Balzer wrote: >>>>> >>>>> Hello, >>>>> >>>>> On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote: >>>>> >>>>>> Greetings all, >>>>>> >>>>>> I have a small ceph cluster (4 nodes, 2 osds per node) which recently >>>>>> started showing: >>>>>> >>>>>> root at ocd45:~# ceph health >>>>>> HEALTH_WARN 1 near full osd(s) >>>>>> >>>>>> admin at node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep >>>>>> 'Filesystem|osd/ceph'; done >>>>>> Filesystem Size Used Avail Use% Mounted on >>>>>> /dev/sdc1 442G 249G 194G 57% /var/lib/ceph/osd/ceph-5 >>>>>> /dev/sdb1 442G 287G 156G 65% /var/lib/ceph/osd/ceph-1 >>>>>> Filesystem Size Used Avail Use% Mounted on >>>>>> /dev/sdc1 442G 396G 47G 90% /var/lib/ceph/osd/ceph-7 >>>>>> /dev/sdb1 442G 316G 127G 72% /var/lib/ceph/osd/ceph-3 >>>>>> Filesystem Size Used Avail Use% Mounted on >>>>>> /dev/sdb1 442G 229G 214G 52% /var/lib/ceph/osd/ceph-2 >>>>>> /dev/sdc1 442G 229G 214G 52% /var/lib/ceph/osd/ceph-6 >>>>>> Filesystem Size Used Avail Use% Mounted on >>>>>> /dev/sdc1 442G 238G 205G 54% /var/lib/ceph/osd/ceph-4 >>>>>> /dev/sdb1 442G 278G 165G 63% /var/lib/ceph/osd/ceph-0 >>>>>> >>>>>> >>>>> See the very recent "Uneven OSD usage" for a discussion about this. >>>>> What are your PG/PGP values? >>>>> >>>>>> This cluster has been running for weeks, under significant load, and >>>>>> has been 100% stable. Unfortunately we have to ship it out of the >>>>>> building to another part of our business (where we will have little >>>>>> access to it). >>>>>> >>>>>> Based on what I've read about 'ceph osd reweight' I'm a bit hesitant >>>>>> to just run it (I don't want to do anything that impacts this >>>>>> cluster's stability). >>>>>> >>>>>> Is there another, better way to equalize the distribution the data on >>>>>> the osd partitions? >>>>>> >>>>>> I'm running dumpling. >>>>>> >>>>> As per the thread and my experience, Firefly would solve this. If you >>>>> can upgrade during a weekend or whenever there is little to no >>>>> access, do it. >>>>> >>>>> Another option (of course any and all of these will result in data >>>>> movement, so pick an appropriate time), would be to "use ceph osd >>>>> reweight" to lower the weight of osd.7 in particular. >>>>> >>>>> Lastly, given the utilization of your cluster, your really ought to >>>>> deploy more OSDs and/or more nodes, if a node would go down you'd >>>>> easily get into a "real" near full or full situation. >>>>> >>>>> Regards, >>>>> >>>>> Christian >>>>> >>>> >>> >> >> > -- Your electronic communications are being monitored; strong encryption is an answer. My public key <http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x4F08C504BD634953>