Is ceph osd reweight always safe to use?

botemout@xxxxxxxxx (JR) · Thu, 11 Sep 2014 13:03:42 -0400

Greetings

Just a follow up on the resolution of this issue.

Restarting ceph-osd on one of the nodes solved the problem of the
stuck unclean pgs.

Thanks,
JR

On 9/9/2014 2:24 AM, Christian Balzer wrote:
> 
> Hello,
> 
> On Tue, 09 Sep 2014 01:25:17 -0400 JR wrote:
> 
>> Greetings
>> 
>> After running for a couple of hours, my attempt to re-balance a 
>> near ful disk has stopped with a stuck unclean error:
>> 
> Which is exactly what I warned you about below and what you should 
> have also taken away from fully reading the "Uneven OSD usage" 
> thread.
> 
> This also should hammer my previous point about your current 
> cluster size/utilization home. Even with a better (don't expect 
> perfect) data distribution, loss of one node might well find you 
> with a full OSD again.
> 
>> root at osd45:~# ceph -s cluster 
>> c8122868-27af-11e4-b570-52540004010f health HEALTH_WARN 6 pgs 
>> backfilling; 6 pgs stuck unclean; recovery 13086/1158268
>> degraded (1.130%) monmap e1: 3 mons at 
>> {osd42=10.7.7.142:6789/0,osd43=10.7.7.143:6789/0,osd45=10.7.7.145:6789/0},
>>
>>
>> 
election epoch 80, quorum 0,1,2 osd42,osd43,osd45
>> osdmap e723: 8 osds: 8 up, 8 in pgmap v543113: 640 pgs: 634 
>> active+clean, 6 active+remapped+backfilling; 2222 GB data, 2239 
>> GB used, 1295 GB / 3535 GB avail; 8268B/s wr, 0op/s; 
>> 13086/1158268 degraded (1.130%) mdsmap e63: 1/1/1 up 
>> {0=osd42=up:active}, 3 up:standby
>> 
> From what I've read in the past the way forward here is to
> increase the full ratio setting so it can finish the recovery. Or
> add more OSDs, at least temporarily. See: 
> http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity
>
>
> 
Read that and apply that knowledge to your cluster, I personally
> wouldn't deploy it in this state.
> 
> Once the recovery is finished I'd proceed cautiously, see below.
> 
>> 
>> The sequence of events today that led to this were:
>> 
>> # starting state: pg_num/pgp_num == 64 ceph osd pool set rbd 
>> pg_num 128 ceph osd pool set rbd pgp_num 128 # there was a 
>> warning thrown up (which I've lost) and which left pgg_num == 64
>>  # nothing happens since pgp_num was inadvertently not raised
>> ceph osd reweight-by-utilization # data moves from one osd on a
>> host to another osd on same host ceph osd reweight  7 1 # data
>> moves back to roughly what it had been
> Never mind the the lack of PGs to play with, manually lowering the 
> weight of the fullest OSD (in small steps) at this time might have 
> given you at least a more level playing field.
> 
>> ceph osd pool set volumes pg_num 192 ceph osd pool set volumes 
>> pgp_num 192 # data moves successfully
> This would have been the time to check what actually happened and 
> if things improved or not (just adding PGs/PGPs might not be 
> enough) and again to manually reweight overly full OSDs.
> 
>> ceph osd pool set rbd pg_num 192 ceph osd pool set rbd pgp_num 
>> 192 # data stuck
>> 
> Baby steps. As in, applying the rise to 128 PGPs first. But I
> guess you would have run into the full OSD either way w/o
> reweighting things between steps.
> 
>> googling (nowadays known as research) reveals that these might be
>> helpful:
>> 
>> - ceph osd crush tunables optimal
> Yes, this might help. Not sure if that works with dumpling, but as 
> I already mentioned dumpling doesn't support "chooseleaf_vary_r". 
> And hashspool. And while the data movement caused by this probably 
> will result in a better balanced cluster (again, with too little 
> PGs it will still do poorly), in the process of getting there it 
> might still run into a full OSD scenario.
> 
>> - setting crush weights to 1
>> 
> Dunno about then one, my crush weights were 1 when I deployed 
> things manually for the first time, the size of the OSD for the
> 2nd manual deployment and ceph-deploy also uses the OSD size in
> TB.
> 
> Christian
> 
>> I resist doing anything for now in the hopes that someone has 
>> something coherent to say (Christian? ;-)
>> 
>> Thanks JR
>> 
>> 
>> On 9/8/2014 10:37 PM, JR wrote:
>>> Hi Christian,
>>> 
>>> Ha ...
>>> 
>>> root at osd45:~# ceph osd pool get rbd pg_num pg_num: 128 
>>> root at osd45:~# ceph osd pool get rbd pgp_num pgp_num: 64
>>> 
>>> That's the explanation!  I did run the command but it spit out 
>>> some (what I thought was a harmless) warning; should have 
>>> checked more carefully.
>>> 
>>> I now have the expected data movement.
>>> 
>>> Thanks alot! JR
>>> 
>>> On 9/8/2014 10:04 PM, Christian Balzer wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote:
>>>> 
>>>>> Hi Christian, all,
>>>>> 
>>>>> Having researched this a bit more, it seemed that just 
>>>>> doing
>>>>> 
>>>>> ceph osd pool set rbd pg_num 128 ceph osd pool set rbd 
>>>>> pgp_num 128
>>>>> 
>>>>> might be the answer.  Alas, it was not. After running the 
>>>>> above the cluster just sat there.
>>>>> 
>>>> Really now? No data movement, no health warnings during that 
>>>> in the logs, no other error in the logs or when issuing that 
>>>> command? Is it really at 128 now, verified with "ceph osd 
>>>> pool get rbd pg_num"?
>>>> 
>>>> You really want to get this addressed as per the previous 
>>>> reply before doing anything further. Because with just 64
>>>> PGs (as in only 8 per OSD!) massive imbalances are a given.
>>>> 
>>>>> Finally, reading some more, I ran:
>>>>> 
>>>>> ceph osd reweight-by-utilization
>>>>> 
>>>> Reading can be dangerous. ^o^
>>>> 
>>>> I didn't mention this, as it never worked for me in any 
>>>> predictable way and with a desirable outcome, especially in 
>>>> situations like yours.
>>>> 
>>>>> This accomplished moving the utilization of the first
>>>>> drive on the affected node to the 2nd drive! .e.g.:
>>>>> 
>>>>> ------- BEFORE RUNNING: ------- Filesystem     Use% 
>>>>> /dev/sdc1     57% /dev/sdb1     65% Filesystem     Use% 
>>>>> /dev/sdc1     90% /dev/sdb1     75% Filesystem     Use% 
>>>>> /dev/sdb1     52% /dev/sdc1     52% Filesystem     Use% 
>>>>> /dev/sdc1     54% /dev/sdb1     63%
>>>>> 
>>>>> ------- AFTER RUNNING: ------- Filesystem     Use% 
>>>>> /dev/sdc1     57% /dev/sdb1     65% Filesystem     Use% 
>>>>> /dev/sdc1     70%          ** these two swapped (roughly) 
>>>>> ** /dev/sdb1     92%          ** ^^^^^ ^^^ ^^^^^^^ **
>>>>> Filesystem     Use% /dev/sdb1     52% /dev/sdc1     52% 
>>>>> Filesystem     Use% /dev/sdc1     54% /dev/sdb1     63%
>>>>> 
>>>>> root at osd45:~# ceph osd tree # id    weight  type name 
>>>>> up/down reweight -1      3.44    root default -2      0.86 
>>>>> host osd45 0       0.43                    osd.0   up 1 4
>>>>> 0.43                    osd.4   up      1 -3 0.86
>>>>> host osd42 1       0.43 osd.1   up      1 5       0.43
>>>>> osd.5 up      1 -4      0.86            host osd44 2
>>>>> 0.43 osd.2   up      1 6       0.43
>>>>> osd.6 up      1 -5      0.86            host osd43 3
>>>>> 0.43 osd.3   up      1 7       0.43
>>>>> osd.7 up      0.7007
>>>>> 
>>>>> So this isn't the answer either.
>>>>> 
>>>> It might have been, if it had more PGs to distribute things 
>>>> along, see above. But even then with the default dumpling 
>>>> tunables it might not be much better.
>>>> 
>>>>> Could someone please chime in with an 
>>>>> explanation/suggestion?
>>>>> 
>>>>> I suspect that might make sense to use 'ceph osd reweight 
>>>>> osd.7 1' and then run some form of 'ceph osd crush ...'?
>>>>> 
>>>> No need to crush anything, reweight it to 1 after adding 
>>>> PGs/PGPs and after all that data movement has finished
>>>> slowly dial down any still overly utilized OSD.
>>>> 
>>>> Also per the "Uneven OSD usage" thread, you might run into a 
>>>> "full" situation during data re-distribution. Increase PGs
>>>> in small (64) increments.
>>>> 
>>>>> Of course, I've read a number of things which suggest that 
>>>>> the two things I've done should have fixed my problem.
>>>>> 
>>>>> Is it (gasp!) possible that this, as Christian suggests,
>>>>> is a dumpling issue and, were I running on firefly, it
>>>>> would be sufficient?
>>>>> 
>>>> Running Firefly with all the tunables and probably 
>>>> hashpspool. Most of the tunables with the exception of 
>>>> "chooseleaf_vary_r" are available on dumpling, hashpspool 
>>>> isn't AFAIK. See 
>>>> http://ceph.com/docs/master/rados/operations/crush-map/#tunables
>>>>
>>>>
>>>>
>>>> 
Christian
>>>>> 
>>>>> Thanks much JR On 9/8/2014 1:50 PM, JR wrote:
>>>>>> Hi Christian,
>>>>>> 
>>>>>> I have 448 PGs and 448 PGPs (according to ceph -s).
>>>>>> 
>>>>>> This seems borne out by:
>>>>>> 
>>>>>> root at osd45:~# rados lspools data metadata rbd volumes 
>>>>>> images root at osd45:~# for i in $(rados lspools); do echo 
>>>>>> "$i pg($(ceph osd pool get $i pg_num), pgp$(ceph osd
>>>>>> pool get $i pg_num)"; done data pg(pg_num: 64, pgppg_num:
>>>>>> 64 metadata pg(pg_num: 64, pgppg_num: 64 rbd pg(pg_num:
>>>>>> 64, pgppg_num: 64 volumes pg(pg_num: 128, pgppg_num: 128
>>>>>>  images pg(pg_num: 128, pgppg_num: 128
>>>>>> 
>>>>>> According to the formula discussed in 'Uneven OSD 
>>>>>> usage,'
>>>>>> 
>>>>>> "The formula is actually OSDs * 100 / replication
>>>>>> 
>>>>>> in my case:
>>>>>> 
>>>>>> 8*100/2=400
>>>>>> 
>>>>>> So I'm erroring on the large size?
>>>>>> 
>>>>>> Or, does this formula apply on by pool basis?  Of my 5 
>>>>>> pools I'm using 3:
>>>>>> 
>>>>>> root at osd45:~# rados df|cut -c1-45 pool name category
>>>>>> KB data            - 0 images          -
>>>>>> 0 metadata -                         10 rbd
>>>>>> - 568489533 volumes         -                  594078601
>>>>>>  total used      2326235048       285923 total avail 
>>>>>> 1380814968 total space     3707050016
>>>>>> 
>>>>>> So should I up the number of PGs for the rbd and volumes 
>>>>>> pools?
>>>>>> 
>>>>>> I'll continue looking at docs, but for now I'll send
>>>>>> this off.
>>>>>> 
>>>>>> Thanks very much, Christain.
>>>>>> 
>>>>>> ps. This cluster is self-contained and all nodes in it 
>>>>>> are completely loaded (i.e., I can't add any more nodes 
>>>>>> nor disks). It's also not an option at the moment to 
>>>>>> upgrade to firefly (can't make a big change before 
>>>>>> sending it out the door).
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 9/8/2014 12:09 PM, Christian Balzer wrote:
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote:
>>>>>>> 
>>>>>>>> Greetings all,
>>>>>>>> 
>>>>>>>> I have a small ceph cluster (4 nodes, 2 osds per 
>>>>>>>> node) which recently started showing:
>>>>>>>> 
>>>>>>>> root at ocd45:~# ceph health HEALTH_WARN 1 near full 
>>>>>>>> osd(s)
>>>>>>>> 
>>>>>>>> admin at node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i 
>>>>>>>> df -h |egrep 'Filesystem|osd/ceph'; done Filesystem 
>>>>>>>> Size  Used Avail Use% Mounted on /dev/sdc1
>>>>>>>> 442G 249G  194G  57% /var/lib/ceph/osd/ceph-5
>>>>>>>> /dev/sdb1 442G  287G  156G  65%
>>>>>>>> /var/lib/ceph/osd/ceph-1 Filesystem      Size  Used
>>>>>>>> Avail Use% Mounted on /dev/sdc1       442G  396G
>>>>>>>> 47G  90% /var/lib/ceph/osd/ceph-7 /dev/sdb1
>>>>>>>> 442G  316G 127G  72% /var/lib/ceph/osd/ceph-3
>>>>>>>> Filesystem Size  Used Avail Use% Mounted on /dev/sdb1
>>>>>>>> 442G 229G  214G  52% /var/lib/ceph/osd/ceph-2
>>>>>>>> /dev/sdc1 442G  229G  214G  52%
>>>>>>>> /var/lib/ceph/osd/ceph-6 Filesystem      Size  Used
>>>>>>>> Avail Use% Mounted on /dev/sdc1       442G  238G
>>>>>>>> 205G  54% /var/lib/ceph/osd/ceph-4 /dev/sdb1
>>>>>>>> 442G  278G 165G  63% /var/lib/ceph/osd/ceph-0
>>>>>>>> 
>>>>>>>> 
>>>>>>> See the very recent "Uneven OSD usage" for a
>>>>>>> discussion about this. What are your PG/PGP values?
>>>>>>> 
>>>>>>>> This cluster has been running for weeks, under 
>>>>>>>> significant load, and has been 100% stable. 
>>>>>>>> Unfortunately we have to ship it out of the building 
>>>>>>>> to another part of our business (where we will have 
>>>>>>>> little access to it).
>>>>>>>> 
>>>>>>>> Based on what I've read about 'ceph osd reweight'
>>>>>>>> I'm a bit hesitant to just run it (I don't want to
>>>>>>>> do anything that impacts this cluster's stability).
>>>>>>>> 
>>>>>>>> Is there another, better way to equalize the 
>>>>>>>> distribution the data on the osd partitions?
>>>>>>>> 
>>>>>>>> I'm running dumpling.
>>>>>>>> 
>>>>>>> As per the thread and my experience, Firefly would 
>>>>>>> solve this. If you can upgrade during a weekend or 
>>>>>>> whenever there is little to no access, do it.
>>>>>>> 
>>>>>>> Another option (of course any and all of these will 
>>>>>>> result in data movement, so pick an appropriate time), 
>>>>>>> would be to "use ceph osd reweight" to lower the
>>>>>>> weight of osd.7 in particular.
>>>>>>> 
>>>>>>> Lastly, given the utilization of your cluster, your 
>>>>>>> really ought to deploy more OSDs and/or more nodes, if 
>>>>>>> a node would go down you'd easily get into a "real" 
>>>>>>> near full or full situation.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> 
>>>>>>> Christian
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
> 
> 

-- 
Your electronic communications are being monitored; strong encryption is
an answer. My public key
<http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x4F08C504BD634953>