Is ceph osd reweight always safe to use?

botemout@xxxxxxxxx (JR) · Tue, 09 Sep 2014 01:25:17 -0400

Greetings

After running for a couple of hours, my attempt to re-balance a near ful
disk has stopped with a stuck unclean error:

root at osd45:~# ceph -s
  cluster c8122868-27af-11e4-b570-52540004010f
   health HEALTH_WARN 6 pgs backfilling; 6 pgs stuck unclean; recovery
13086/1158268 degraded (1.130%)
   monmap e1: 3 mons at
{osd42=10.7.7.142:6789/0,osd43=10.7.7.143:6789/0,osd45=10.7.7.145:6789/0},
election epoch 80, quorum 0,1,2 osd42,osd43,osd45
   osdmap e723: 8 osds: 8 up, 8 in
    pgmap v543113: 640 pgs: 634 active+clean, 6
active+remapped+backfilling; 2222 GB data, 2239 GB used, 1295 GB / 3535
GB avail; 8268B/s wr, 0op/s; 13086/1158268 degraded (1.130%)
   mdsmap e63: 1/1/1 up {0=osd42=up:active}, 3 up:standby

The sequence of events today that led to this were:

# starting state: pg_num/pgp_num == 64
ceph osd pool set rbd pg_num 128
ceph osd pool set rbd pgp_num 128
# there was a warning thrown up (which I've lost) and which left pgg_num
== 64
# nothing happens since pgp_num was inadvertently not raised
ceph osd reweight-by-utilization
# data moves from one osd on a host to another osd on same host
ceph osd reweight  7 1
# data moves back to roughly what it had been
ceph osd pool set volumes pg_num 192
ceph osd pool set volumes pgp_num 192
# data moves successfully
ceph osd pool set rbd pg_num 192
ceph osd pool set rbd pgp_num 192
# data stuck

googling (nowadays known as research) reveals that these might be helpful:

- ceph osd crush tunables optimal
- setting crush weights to 1

I resist doing anything for now in the hopes that someone has something
coherent to say (Christian? ;-)

Thanks
JR

On 9/8/2014 10:37 PM, JR wrote:
> Hi Christian,
> 
> Ha ...
> 
> root at osd45:~# ceph osd pool get rbd pg_num
> pg_num: 128
> root at osd45:~# ceph osd pool get rbd pgp_num
> pgp_num: 64
> 
> That's the explanation!  I did run the command but it spit out some
> (what I thought was a harmless) warning; should have checked more carefully.
> 
> I now have the expected data movement.
> 
> Thanks alot!
> JR
> 
> On 9/8/2014 10:04 PM, Christian Balzer wrote:
>>
>> Hello,
>>
>> On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote:
>>
>>> Hi Christian, all,
>>>
>>> Having researched this a bit more, it seemed that just doing
>>>
>>> ceph osd pool set rbd pg_num 128
>>> ceph osd pool set rbd pgp_num 128
>>>
>>> might be the answer.  Alas, it was not. After running the above the
>>> cluster just sat there.
>>>
>> Really now? No data movement, no health warnings during that in the logs,
>> no other error in the logs or when issuing that command?
>> Is it really at 128 now, verified with "ceph osd pool get rbd pg_num"?
>>
>> You really want to get this addressed as per the previous reply before
>> doing anything further. Because with just 64 PGs (as in only 8 per OSD!)
>> massive imbalances are a given.
>>
>>> Finally, reading some more, I ran:
>>>
>>>  ceph osd reweight-by-utilization
>>>
>> Reading can be dangerous. ^o^
>>
>> I didn't mention this, as it never worked for me in any predictable way
>> and with a desirable outcome, especially in situations like yours.
>>
>>> This accomplished moving the utilization of the first drive on the
>>> affected node to the 2nd drive! .e.g.:
>>>
>>> -------
>>> BEFORE RUNNING:
>>> -------
>>> Filesystem     Use%
>>> /dev/sdc1     57%
>>> /dev/sdb1     65%
>>> Filesystem     Use%
>>> /dev/sdc1     90%
>>> /dev/sdb1     75%
>>> Filesystem     Use%
>>> /dev/sdb1     52%
>>> /dev/sdc1     52%
>>> Filesystem     Use%
>>> /dev/sdc1     54%
>>> /dev/sdb1     63%
>>>
>>> -------
>>> AFTER RUNNING:
>>> -------
>>> Filesystem     Use%
>>> /dev/sdc1     57%
>>> /dev/sdb1     65%
>>> Filesystem     Use%
>>> /dev/sdc1     70%          ** these two swapped (roughly) **
>>> /dev/sdb1     92%          ** ^^^^^ ^^^ ^^^^^^^           **
>>> Filesystem     Use%
>>> /dev/sdb1     52%
>>> /dev/sdc1     52%
>>> Filesystem     Use%
>>> /dev/sdc1     54%
>>> /dev/sdb1     63%
>>>
>>> root at osd45:~# ceph osd tree
>>> # id    weight  type name       up/down reweight
>>> -1      3.44    root default
>>> -2      0.86            host osd45
>>> 0       0.43                    osd.0   up      1
>>> 4       0.43                    osd.4   up      1
>>> -3      0.86            host osd42
>>> 1       0.43                    osd.1   up      1
>>> 5       0.43                    osd.5   up      1
>>> -4      0.86            host osd44
>>> 2       0.43                    osd.2   up      1
>>> 6       0.43                    osd.6   up      1
>>> -5      0.86            host osd43
>>> 3       0.43                    osd.3   up      1
>>> 7       0.43                    osd.7   up      0.7007
>>>
>>> So this isn't the answer either.
>>>
>> It might have been, if it had more PGs to distribute things along, see
>> above. But even then with the default dumpling tunables it might not be
>> much better.
>>
>>> Could someone please chime in with an explanation/suggestion?
>>>
>>> I suspect that might make sense to use 'ceph osd reweight osd.7 1' and
>>> then run some form of 'ceph osd crush ...'?
>>>
>> No need to crush anything, reweight it to 1 after adding PGs/PGPs and
>> after all that data movement has finished slowly dial down any still
>> overly utilized OSD.
>>
>> Also per the "Uneven OSD usage" thread, you might run into a "full"
>> situation during data re-distribution. Increase PGs in small (64)
>> increments.
>>
>>> Of course, I've read a number of things which suggest that the two
>>> things I've done should have fixed my problem.
>>>
>>> Is it (gasp!) possible that this, as Christian suggests, is a dumpling
>>> issue and, were I running on firefly, it would be sufficient?
>>>
>> Running Firefly with all the tunables and probably hashpspool. 
>> Most of the tunables with the exception of "chooseleaf_vary_r" are
>> available on dumpling, hashpspool isn't AFAIK.
>> See http://ceph.com/docs/master/rados/operations/crush-map/#tunables
>>
>> Christian
>>>
>>> Thanks much
>>> JR
>>> On 9/8/2014 1:50 PM, JR wrote:
>>>> Hi Christian,
>>>>
>>>> I have 448 PGs and 448 PGPs (according to ceph -s).
>>>>
>>>> This seems borne out by:
>>>>
>>>> root at osd45:~# rados lspools
>>>> data
>>>> metadata
>>>> rbd
>>>> volumes
>>>> images
>>>> root at osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd pool
>>>> get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done
>>>> data pg(pg_num: 64, pgppg_num: 64
>>>> metadata pg(pg_num: 64, pgppg_num: 64
>>>> rbd pg(pg_num: 64, pgppg_num: 64
>>>> volumes pg(pg_num: 128, pgppg_num: 128
>>>> images pg(pg_num: 128, pgppg_num: 128
>>>>
>>>> According to the formula discussed in 'Uneven OSD usage,'
>>>>
>>>> "The formula is actually OSDs * 100 / replication
>>>>
>>>> in my case:
>>>>
>>>> 8*100/2=400
>>>>
>>>> So I'm erroring on the large size?
>>>>
>>>> Or, does this formula apply on by pool basis?  Of my 5 pools I'm using
>>>> 3:
>>>>
>>>> root at osd45:~# rados df|cut -c1-45
>>>> pool name       category                 KB
>>>> data            -                          0
>>>> images          -                          0
>>>> metadata        -                         10
>>>> rbd             -                  568489533
>>>> volumes         -                  594078601
>>>>   total used      2326235048       285923
>>>>   total avail     1380814968
>>>>   total space     3707050016
>>>>
>>>> So should I up the number of PGs for the rbd and volumes pools?
>>>>
>>>> I'll continue looking at docs, but for now I'll send this off.
>>>>
>>>> Thanks very much, Christain.
>>>>
>>>> ps. This cluster is self-contained and all nodes in it are completely
>>>> loaded (i.e., I can't add any more nodes nor disks).  It's also not an
>>>> option at the moment to upgrade to firefly (can't make a big change
>>>> before sending it out the door).
>>>>
>>>>
>>>>
>>>> On 9/8/2014 12:09 PM, Christian Balzer wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote:
>>>>>
>>>>>> Greetings all,
>>>>>>
>>>>>> I have a small ceph cluster (4 nodes, 2 osds per node) which recently
>>>>>> started showing:
>>>>>>
>>>>>> root at ocd45:~# ceph health
>>>>>> HEALTH_WARN 1 near full osd(s)
>>>>>>
>>>>>> admin at node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep
>>>>>> 'Filesystem|osd/ceph'; done
>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>> /dev/sdc1       442G  249G  194G  57% /var/lib/ceph/osd/ceph-5
>>>>>> /dev/sdb1       442G  287G  156G  65% /var/lib/ceph/osd/ceph-1
>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>> /dev/sdc1       442G  396G   47G  90% /var/lib/ceph/osd/ceph-7
>>>>>> /dev/sdb1       442G  316G  127G  72% /var/lib/ceph/osd/ceph-3
>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>> /dev/sdb1       442G  229G  214G  52% /var/lib/ceph/osd/ceph-2
>>>>>> /dev/sdc1       442G  229G  214G  52% /var/lib/ceph/osd/ceph-6
>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>> /dev/sdc1       442G  238G  205G  54% /var/lib/ceph/osd/ceph-4
>>>>>> /dev/sdb1       442G  278G  165G  63% /var/lib/ceph/osd/ceph-0
>>>>>>
>>>>>>
>>>>> See the very recent "Uneven OSD usage" for a discussion about this.
>>>>> What are your PG/PGP values?
>>>>>
>>>>>> This cluster has been running for weeks, under significant load, and
>>>>>> has been 100% stable. Unfortunately we have to ship it out of the
>>>>>> building to another part of our business (where we will have little
>>>>>> access to it).
>>>>>>
>>>>>> Based on what I've read about 'ceph osd reweight' I'm a bit hesitant
>>>>>> to just run it (I don't want to do anything that impacts this
>>>>>> cluster's stability).
>>>>>>
>>>>>> Is there another, better way to equalize the distribution the data on
>>>>>> the osd partitions?
>>>>>>
>>>>>> I'm running dumpling.
>>>>>>
>>>>> As per the thread and my experience, Firefly would solve this. If you
>>>>> can upgrade during a weekend or whenever there is little to no
>>>>> access, do it.
>>>>>
>>>>> Another option (of course any and all of these will result in data
>>>>> movement, so pick an appropriate time), would be to "use ceph osd
>>>>> reweight" to lower the weight of osd.7 in particular.
>>>>>
>>>>> Lastly, given the utilization of your cluster, your really ought to
>>>>> deploy more OSDs and/or more nodes, if a node would go down you'd
>>>>> easily get into a "real" near full or full situation.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Christian
>>>>>
>>>>
>>>
>>
>>
> 

-- 
Your electronic communications are being monitored; strong encryption is
an answer. My public key
<http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x4F08C504BD634953>