Greetings Just a follow up on the resolution of this issue. Restarting ceph-osd on one of the nodes solved the problem of the stuck unclean pgs. Thanks, JR On 9/9/2014 2:24 AM, Christian Balzer wrote: > > Hello, > > On Tue, 09 Sep 2014 01:25:17 -0400 JR wrote: > >> Greetings >> >> After running for a couple of hours, my attempt to re-balance a >> near ful disk has stopped with a stuck unclean error: >> > Which is exactly what I warned you about below and what you should > have also taken away from fully reading the "Uneven OSD usage" > thread. > > This also should hammer my previous point about your current > cluster size/utilization home. Even with a better (don't expect > perfect) data distribution, loss of one node might well find you > with a full OSD again. > >> root at osd45:~# ceph -s cluster >> c8122868-27af-11e4-b570-52540004010f health HEALTH_WARN 6 pgs >> backfilling; 6 pgs stuck unclean; recovery 13086/1158268 >> degraded (1.130%) monmap e1: 3 mons at >> {osd42=10.7.7.142:6789/0,osd43=10.7.7.143:6789/0,osd45=10.7.7.145:6789/0}, >> >> >> election epoch 80, quorum 0,1,2 osd42,osd43,osd45 >> osdmap e723: 8 osds: 8 up, 8 in pgmap v543113: 640 pgs: 634 >> active+clean, 6 active+remapped+backfilling; 2222 GB data, 2239 >> GB used, 1295 GB / 3535 GB avail; 8268B/s wr, 0op/s; >> 13086/1158268 degraded (1.130%) mdsmap e63: 1/1/1 up >> {0=osd42=up:active}, 3 up:standby >> > From what I've read in the past the way forward here is to > increase the full ratio setting so it can finish the recovery. Or > add more OSDs, at least temporarily. See: > http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity > > > Read that and apply that knowledge to your cluster, I personally > wouldn't deploy it in this state. > > Once the recovery is finished I'd proceed cautiously, see below. > >> >> The sequence of events today that led to this were: >> >> # starting state: pg_num/pgp_num == 64 ceph osd pool set rbd >> pg_num 128 ceph osd pool set rbd pgp_num 128 # there was a >> warning thrown up (which I've lost) and which left pgg_num == 64 >> # nothing happens since pgp_num was inadvertently not raised >> ceph osd reweight-by-utilization # data moves from one osd on a >> host to another osd on same host ceph osd reweight 7 1 # data >> moves back to roughly what it had been > Never mind the the lack of PGs to play with, manually lowering the > weight of the fullest OSD (in small steps) at this time might have > given you at least a more level playing field. > >> ceph osd pool set volumes pg_num 192 ceph osd pool set volumes >> pgp_num 192 # data moves successfully > This would have been the time to check what actually happened and > if things improved or not (just adding PGs/PGPs might not be > enough) and again to manually reweight overly full OSDs. > >> ceph osd pool set rbd pg_num 192 ceph osd pool set rbd pgp_num >> 192 # data stuck >> > Baby steps. As in, applying the rise to 128 PGPs first. But I > guess you would have run into the full OSD either way w/o > reweighting things between steps. > >> googling (nowadays known as research) reveals that these might be >> helpful: >> >> - ceph osd crush tunables optimal > Yes, this might help. Not sure if that works with dumpling, but as > I already mentioned dumpling doesn't support "chooseleaf_vary_r". > And hashspool. And while the data movement caused by this probably > will result in a better balanced cluster (again, with too little > PGs it will still do poorly), in the process of getting there it > might still run into a full OSD scenario. > >> - setting crush weights to 1 >> > Dunno about then one, my crush weights were 1 when I deployed > things manually for the first time, the size of the OSD for the > 2nd manual deployment and ceph-deploy also uses the OSD size in > TB. > > Christian > >> I resist doing anything for now in the hopes that someone has >> something coherent to say (Christian? ;-) >> >> Thanks JR >> >> >> On 9/8/2014 10:37 PM, JR wrote: >>> Hi Christian, >>> >>> Ha ... >>> >>> root at osd45:~# ceph osd pool get rbd pg_num pg_num: 128 >>> root at osd45:~# ceph osd pool get rbd pgp_num pgp_num: 64 >>> >>> That's the explanation! I did run the command but it spit out >>> some (what I thought was a harmless) warning; should have >>> checked more carefully. >>> >>> I now have the expected data movement. >>> >>> Thanks alot! JR >>> >>> On 9/8/2014 10:04 PM, Christian Balzer wrote: >>>> >>>> Hello, >>>> >>>> On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote: >>>> >>>>> Hi Christian, all, >>>>> >>>>> Having researched this a bit more, it seemed that just >>>>> doing >>>>> >>>>> ceph osd pool set rbd pg_num 128 ceph osd pool set rbd >>>>> pgp_num 128 >>>>> >>>>> might be the answer. Alas, it was not. After running the >>>>> above the cluster just sat there. >>>>> >>>> Really now? No data movement, no health warnings during that >>>> in the logs, no other error in the logs or when issuing that >>>> command? Is it really at 128 now, verified with "ceph osd >>>> pool get rbd pg_num"? >>>> >>>> You really want to get this addressed as per the previous >>>> reply before doing anything further. Because with just 64 >>>> PGs (as in only 8 per OSD!) massive imbalances are a given. >>>> >>>>> Finally, reading some more, I ran: >>>>> >>>>> ceph osd reweight-by-utilization >>>>> >>>> Reading can be dangerous. ^o^ >>>> >>>> I didn't mention this, as it never worked for me in any >>>> predictable way and with a desirable outcome, especially in >>>> situations like yours. >>>> >>>>> This accomplished moving the utilization of the first >>>>> drive on the affected node to the 2nd drive! .e.g.: >>>>> >>>>> ------- BEFORE RUNNING: ------- Filesystem Use% >>>>> /dev/sdc1 57% /dev/sdb1 65% Filesystem Use% >>>>> /dev/sdc1 90% /dev/sdb1 75% Filesystem Use% >>>>> /dev/sdb1 52% /dev/sdc1 52% Filesystem Use% >>>>> /dev/sdc1 54% /dev/sdb1 63% >>>>> >>>>> ------- AFTER RUNNING: ------- Filesystem Use% >>>>> /dev/sdc1 57% /dev/sdb1 65% Filesystem Use% >>>>> /dev/sdc1 70% ** these two swapped (roughly) >>>>> ** /dev/sdb1 92% ** ^^^^^ ^^^ ^^^^^^^ ** >>>>> Filesystem Use% /dev/sdb1 52% /dev/sdc1 52% >>>>> Filesystem Use% /dev/sdc1 54% /dev/sdb1 63% >>>>> >>>>> root at osd45:~# ceph osd tree # id weight type name >>>>> up/down reweight -1 3.44 root default -2 0.86 >>>>> host osd45 0 0.43 osd.0 up 1 4 >>>>> 0.43 osd.4 up 1 -3 0.86 >>>>> host osd42 1 0.43 osd.1 up 1 5 0.43 >>>>> osd.5 up 1 -4 0.86 host osd44 2 >>>>> 0.43 osd.2 up 1 6 0.43 >>>>> osd.6 up 1 -5 0.86 host osd43 3 >>>>> 0.43 osd.3 up 1 7 0.43 >>>>> osd.7 up 0.7007 >>>>> >>>>> So this isn't the answer either. >>>>> >>>> It might have been, if it had more PGs to distribute things >>>> along, see above. But even then with the default dumpling >>>> tunables it might not be much better. >>>> >>>>> Could someone please chime in with an >>>>> explanation/suggestion? >>>>> >>>>> I suspect that might make sense to use 'ceph osd reweight >>>>> osd.7 1' and then run some form of 'ceph osd crush ...'? >>>>> >>>> No need to crush anything, reweight it to 1 after adding >>>> PGs/PGPs and after all that data movement has finished >>>> slowly dial down any still overly utilized OSD. >>>> >>>> Also per the "Uneven OSD usage" thread, you might run into a >>>> "full" situation during data re-distribution. Increase PGs >>>> in small (64) increments. >>>> >>>>> Of course, I've read a number of things which suggest that >>>>> the two things I've done should have fixed my problem. >>>>> >>>>> Is it (gasp!) possible that this, as Christian suggests, >>>>> is a dumpling issue and, were I running on firefly, it >>>>> would be sufficient? >>>>> >>>> Running Firefly with all the tunables and probably >>>> hashpspool. Most of the tunables with the exception of >>>> "chooseleaf_vary_r" are available on dumpling, hashpspool >>>> isn't AFAIK. See >>>> http://ceph.com/docs/master/rados/operations/crush-map/#tunables >>>> >>>> >>>> >>>> Christian >>>>> >>>>> Thanks much JR On 9/8/2014 1:50 PM, JR wrote: >>>>>> Hi Christian, >>>>>> >>>>>> I have 448 PGs and 448 PGPs (according to ceph -s). >>>>>> >>>>>> This seems borne out by: >>>>>> >>>>>> root at osd45:~# rados lspools data metadata rbd volumes >>>>>> images root at osd45:~# for i in $(rados lspools); do echo >>>>>> "$i pg($(ceph osd pool get $i pg_num), pgp$(ceph osd >>>>>> pool get $i pg_num)"; done data pg(pg_num: 64, pgppg_num: >>>>>> 64 metadata pg(pg_num: 64, pgppg_num: 64 rbd pg(pg_num: >>>>>> 64, pgppg_num: 64 volumes pg(pg_num: 128, pgppg_num: 128 >>>>>> images pg(pg_num: 128, pgppg_num: 128 >>>>>> >>>>>> According to the formula discussed in 'Uneven OSD >>>>>> usage,' >>>>>> >>>>>> "The formula is actually OSDs * 100 / replication >>>>>> >>>>>> in my case: >>>>>> >>>>>> 8*100/2=400 >>>>>> >>>>>> So I'm erroring on the large size? >>>>>> >>>>>> Or, does this formula apply on by pool basis? Of my 5 >>>>>> pools I'm using 3: >>>>>> >>>>>> root at osd45:~# rados df|cut -c1-45 pool name category >>>>>> KB data - 0 images - >>>>>> 0 metadata - 10 rbd >>>>>> - 568489533 volumes - 594078601 >>>>>> total used 2326235048 285923 total avail >>>>>> 1380814968 total space 3707050016 >>>>>> >>>>>> So should I up the number of PGs for the rbd and volumes >>>>>> pools? >>>>>> >>>>>> I'll continue looking at docs, but for now I'll send >>>>>> this off. >>>>>> >>>>>> Thanks very much, Christain. >>>>>> >>>>>> ps. This cluster is self-contained and all nodes in it >>>>>> are completely loaded (i.e., I can't add any more nodes >>>>>> nor disks). It's also not an option at the moment to >>>>>> upgrade to firefly (can't make a big change before >>>>>> sending it out the door). >>>>>> >>>>>> >>>>>> >>>>>> On 9/8/2014 12:09 PM, Christian Balzer wrote: >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote: >>>>>>> >>>>>>>> Greetings all, >>>>>>>> >>>>>>>> I have a small ceph cluster (4 nodes, 2 osds per >>>>>>>> node) which recently started showing: >>>>>>>> >>>>>>>> root at ocd45:~# ceph health HEALTH_WARN 1 near full >>>>>>>> osd(s) >>>>>>>> >>>>>>>> admin at node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i >>>>>>>> df -h |egrep 'Filesystem|osd/ceph'; done Filesystem >>>>>>>> Size Used Avail Use% Mounted on /dev/sdc1 >>>>>>>> 442G 249G 194G 57% /var/lib/ceph/osd/ceph-5 >>>>>>>> /dev/sdb1 442G 287G 156G 65% >>>>>>>> /var/lib/ceph/osd/ceph-1 Filesystem Size Used >>>>>>>> Avail Use% Mounted on /dev/sdc1 442G 396G >>>>>>>> 47G 90% /var/lib/ceph/osd/ceph-7 /dev/sdb1 >>>>>>>> 442G 316G 127G 72% /var/lib/ceph/osd/ceph-3 >>>>>>>> Filesystem Size Used Avail Use% Mounted on /dev/sdb1 >>>>>>>> 442G 229G 214G 52% /var/lib/ceph/osd/ceph-2 >>>>>>>> /dev/sdc1 442G 229G 214G 52% >>>>>>>> /var/lib/ceph/osd/ceph-6 Filesystem Size Used >>>>>>>> Avail Use% Mounted on /dev/sdc1 442G 238G >>>>>>>> 205G 54% /var/lib/ceph/osd/ceph-4 /dev/sdb1 >>>>>>>> 442G 278G 165G 63% /var/lib/ceph/osd/ceph-0 >>>>>>>> >>>>>>>> >>>>>>> See the very recent "Uneven OSD usage" for a >>>>>>> discussion about this. What are your PG/PGP values? >>>>>>> >>>>>>>> This cluster has been running for weeks, under >>>>>>>> significant load, and has been 100% stable. >>>>>>>> Unfortunately we have to ship it out of the building >>>>>>>> to another part of our business (where we will have >>>>>>>> little access to it). >>>>>>>> >>>>>>>> Based on what I've read about 'ceph osd reweight' >>>>>>>> I'm a bit hesitant to just run it (I don't want to >>>>>>>> do anything that impacts this cluster's stability). >>>>>>>> >>>>>>>> Is there another, better way to equalize the >>>>>>>> distribution the data on the osd partitions? >>>>>>>> >>>>>>>> I'm running dumpling. >>>>>>>> >>>>>>> As per the thread and my experience, Firefly would >>>>>>> solve this. If you can upgrade during a weekend or >>>>>>> whenever there is little to no access, do it. >>>>>>> >>>>>>> Another option (of course any and all of these will >>>>>>> result in data movement, so pick an appropriate time), >>>>>>> would be to "use ceph osd reweight" to lower the >>>>>>> weight of osd.7 in particular. >>>>>>> >>>>>>> Lastly, given the utilization of your cluster, your >>>>>>> really ought to deploy more OSDs and/or more nodes, if >>>>>>> a node would go down you'd easily get into a "real" >>>>>>> near full or full situation. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Christian >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >> > > -- Your electronic communications are being monitored; strong encryption is an answer. My public key <http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x4F08C504BD634953>