Is ceph osd reweight always safe to use?

botemout@xxxxxxxxx (JR) · Mon, 08 Sep 2014 18:30:07 -0400

Hi Christian, all,

Having researched this a bit more, it seemed that just doing

ceph osd pool set rbd pg_num 128
ceph osd pool set rbd pgp_num 128

might be the answer.  Alas, it was not. After running the above the
cluster just sat there.

Finally, reading some more, I ran:

 ceph osd reweight-by-utilization

This accomplished moving the utilization of the first drive on the
affected node to the 2nd drive! .e.g.:

-------
BEFORE RUNNING:
-------
Filesystem     Use%
/dev/sdc1     57%
/dev/sdb1     65%
Filesystem     Use%
/dev/sdc1     90%
/dev/sdb1     75%
Filesystem     Use%
/dev/sdb1     52%
/dev/sdc1     52%
Filesystem     Use%
/dev/sdc1     54%
/dev/sdb1     63%

-------
AFTER RUNNING:
-------
Filesystem     Use%
/dev/sdc1     57%
/dev/sdb1     65%
Filesystem     Use%
/dev/sdc1     70%          ** these two swapped (roughly) **
/dev/sdb1     92%          ** ^^^^^ ^^^ ^^^^^^^           **
Filesystem     Use%
/dev/sdb1     52%
/dev/sdc1     52%
Filesystem     Use%
/dev/sdc1     54%
/dev/sdb1     63%

root at osd45:~# ceph osd tree
# id    weight  type name       up/down reweight
-1      3.44    root default
-2      0.86            host osd45
0       0.43                    osd.0   up      1
4       0.43                    osd.4   up      1
-3      0.86            host osd42
1       0.43                    osd.1   up      1
5       0.43                    osd.5   up      1
-4      0.86            host osd44
2       0.43                    osd.2   up      1
6       0.43                    osd.6   up      1
-5      0.86            host osd43
3       0.43                    osd.3   up      1
7       0.43                    osd.7   up      0.7007

So this isn't the answer either.

Could someone please chime in with an explanation/suggestion?

I suspect that might make sense to use 'ceph osd reweight osd.7 1' and
then run some form of 'ceph osd crush ...'?

Of course, I've read a number of things which suggest that the two
things I've done should have fixed my problem.

Is it (gasp!) possible that this, as Christian suggests, is a dumpling
issue and, were I running on firefly, it would be sufficient?

Thanks much
JR
On 9/8/2014 1:50 PM, JR wrote:
> Hi Christian,
> 
> I have 448 PGs and 448 PGPs (according to ceph -s).
> 
> This seems borne out by:
> 
> root at osd45:~# rados lspools
> data
> metadata
> rbd
> volumes
> images
> root at osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd pool
> get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done
> data pg(pg_num: 64, pgppg_num: 64
> metadata pg(pg_num: 64, pgppg_num: 64
> rbd pg(pg_num: 64, pgppg_num: 64
> volumes pg(pg_num: 128, pgppg_num: 128
> images pg(pg_num: 128, pgppg_num: 128
> 
> According to the formula discussed in 'Uneven OSD usage,'
> 
> "The formula is actually OSDs * 100 / replication
> 
> in my case:
> 
> 8*100/2=400
> 
> So I'm erroring on the large size?
> 
> Or, does this formula apply on by pool basis?  Of my 5 pools I'm using 3:
> 
> root at osd45:~# rados df|cut -c1-45
> pool name       category                 KB
> data            -                          0
> images          -                          0
> metadata        -                         10
> rbd             -                  568489533
> volumes         -                  594078601
>   total used      2326235048       285923
>   total avail     1380814968
>   total space     3707050016
> 
> So should I up the number of PGs for the rbd and volumes pools?
> 
> I'll continue looking at docs, but for now I'll send this off.
> 
> Thanks very much, Christain.
> 
> ps. This cluster is self-contained and all nodes in it are completely
> loaded (i.e., I can't add any more nodes nor disks).  It's also not an
> option at the moment to upgrade to firefly (can't make a big change
> before sending it out the door).
> 
> 
> 
> On 9/8/2014 12:09 PM, Christian Balzer wrote:
>>
>> Hello,
>>
>> On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote:
>>
>>> Greetings all,
>>>
>>> I have a small ceph cluster (4 nodes, 2 osds per node) which recently
>>> started showing:
>>>
>>> root at ocd45:~# ceph health
>>> HEALTH_WARN 1 near full osd(s)
>>>
>>> admin at node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep
>>> 'Filesystem|osd/ceph'; done
>>> Filesystem      Size  Used Avail Use% Mounted on
>>> /dev/sdc1       442G  249G  194G  57% /var/lib/ceph/osd/ceph-5
>>> /dev/sdb1       442G  287G  156G  65% /var/lib/ceph/osd/ceph-1
>>> Filesystem      Size  Used Avail Use% Mounted on
>>> /dev/sdc1       442G  396G   47G  90% /var/lib/ceph/osd/ceph-7
>>> /dev/sdb1       442G  316G  127G  72% /var/lib/ceph/osd/ceph-3
>>> Filesystem      Size  Used Avail Use% Mounted on
>>> /dev/sdb1       442G  229G  214G  52% /var/lib/ceph/osd/ceph-2
>>> /dev/sdc1       442G  229G  214G  52% /var/lib/ceph/osd/ceph-6
>>> Filesystem      Size  Used Avail Use% Mounted on
>>> /dev/sdc1       442G  238G  205G  54% /var/lib/ceph/osd/ceph-4
>>> /dev/sdb1       442G  278G  165G  63% /var/lib/ceph/osd/ceph-0
>>>
>>>
>> See the very recent "Uneven OSD usage" for a discussion about this.
>> What are your PG/PGP values?
>>
>>> This cluster has been running for weeks, under significant load, and has
>>> been 100% stable. Unfortunately we have to ship it out of the building
>>> to another part of our business (where we will have little access to it).
>>>
>>> Based on what I've read about 'ceph osd reweight' I'm a bit hesitant to
>>> just run it (I don't want to do anything that impacts this cluster's
>>> stability).
>>>
>>> Is there another, better way to equalize the distribution the data on
>>> the osd partitions?
>>>
>>> I'm running dumpling.
>>>
>> As per the thread and my experience, Firefly would solve this. If you can
>> upgrade during a weekend or whenever there is little to no access, do it.
>>
>> Another option (of course any and all of these will result in data
>> movement, so pick an appropriate time), would be to "use ceph osd
>> reweight" to lower the weight of osd.7 in particular.
>>
>> Lastly, given the utilization of your cluster, your really ought to deploy
>> more OSDs and/or more nodes, if a node would go down you'd easily get into
>> a "real" near full or full situation.
>>
>> Regards,
>>
>> Christian
>>
> 

-- 
Your electronic communications are being monitored; strong encryption is
an answer. My public key
<http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x4F08C504BD634953>