Is ceph osd reweight always safe to use?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Christian,

Ha ...

root at osd45:~# ceph osd pool get rbd pg_num
pg_num: 128
root at osd45:~# ceph osd pool get rbd pgp_num
pgp_num: 64

That's the explanation!  I did run the command but it spit out some
(what I thought was a harmless) warning; should have checked more carefully.

I now have the expected data movement.

Thanks alot!
JR

On 9/8/2014 10:04 PM, Christian Balzer wrote:
> 
> Hello,
> 
> On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote:
> 
>> Hi Christian, all,
>>
>> Having researched this a bit more, it seemed that just doing
>>
>> ceph osd pool set rbd pg_num 128
>> ceph osd pool set rbd pgp_num 128
>>
>> might be the answer.  Alas, it was not. After running the above the
>> cluster just sat there.
>>
> Really now? No data movement, no health warnings during that in the logs,
> no other error in the logs or when issuing that command?
> Is it really at 128 now, verified with "ceph osd pool get rbd pg_num"?
> 
> You really want to get this addressed as per the previous reply before
> doing anything further. Because with just 64 PGs (as in only 8 per OSD!)
> massive imbalances are a given.
> 
>> Finally, reading some more, I ran:
>>
>>  ceph osd reweight-by-utilization
>>
> Reading can be dangerous. ^o^
> 
> I didn't mention this, as it never worked for me in any predictable way
> and with a desirable outcome, especially in situations like yours.
> 
>> This accomplished moving the utilization of the first drive on the
>> affected node to the 2nd drive! .e.g.:
>>
>> -------
>> BEFORE RUNNING:
>> -------
>> Filesystem     Use%
>> /dev/sdc1     57%
>> /dev/sdb1     65%
>> Filesystem     Use%
>> /dev/sdc1     90%
>> /dev/sdb1     75%
>> Filesystem     Use%
>> /dev/sdb1     52%
>> /dev/sdc1     52%
>> Filesystem     Use%
>> /dev/sdc1     54%
>> /dev/sdb1     63%
>>
>> -------
>> AFTER RUNNING:
>> -------
>> Filesystem     Use%
>> /dev/sdc1     57%
>> /dev/sdb1     65%
>> Filesystem     Use%
>> /dev/sdc1     70%          ** these two swapped (roughly) **
>> /dev/sdb1     92%          ** ^^^^^ ^^^ ^^^^^^^           **
>> Filesystem     Use%
>> /dev/sdb1     52%
>> /dev/sdc1     52%
>> Filesystem     Use%
>> /dev/sdc1     54%
>> /dev/sdb1     63%
>>
>> root at osd45:~# ceph osd tree
>> # id    weight  type name       up/down reweight
>> -1      3.44    root default
>> -2      0.86            host osd45
>> 0       0.43                    osd.0   up      1
>> 4       0.43                    osd.4   up      1
>> -3      0.86            host osd42
>> 1       0.43                    osd.1   up      1
>> 5       0.43                    osd.5   up      1
>> -4      0.86            host osd44
>> 2       0.43                    osd.2   up      1
>> 6       0.43                    osd.6   up      1
>> -5      0.86            host osd43
>> 3       0.43                    osd.3   up      1
>> 7       0.43                    osd.7   up      0.7007
>>
>> So this isn't the answer either.
>>
> It might have been, if it had more PGs to distribute things along, see
> above. But even then with the default dumpling tunables it might not be
> much better.
> 
>> Could someone please chime in with an explanation/suggestion?
>>
>> I suspect that might make sense to use 'ceph osd reweight osd.7 1' and
>> then run some form of 'ceph osd crush ...'?
>>
> No need to crush anything, reweight it to 1 after adding PGs/PGPs and
> after all that data movement has finished slowly dial down any still
> overly utilized OSD.
> 
> Also per the "Uneven OSD usage" thread, you might run into a "full"
> situation during data re-distribution. Increase PGs in small (64)
> increments.
> 
>> Of course, I've read a number of things which suggest that the two
>> things I've done should have fixed my problem.
>>
>> Is it (gasp!) possible that this, as Christian suggests, is a dumpling
>> issue and, were I running on firefly, it would be sufficient?
>>
> Running Firefly with all the tunables and probably hashpspool. 
> Most of the tunables with the exception of "chooseleaf_vary_r" are
> available on dumpling, hashpspool isn't AFAIK.
> See http://ceph.com/docs/master/rados/operations/crush-map/#tunables
> 
> Christian
>>
>> Thanks much
>> JR
>> On 9/8/2014 1:50 PM, JR wrote:
>>> Hi Christian,
>>>
>>> I have 448 PGs and 448 PGPs (according to ceph -s).
>>>
>>> This seems borne out by:
>>>
>>> root at osd45:~# rados lspools
>>> data
>>> metadata
>>> rbd
>>> volumes
>>> images
>>> root at osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd pool
>>> get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done
>>> data pg(pg_num: 64, pgppg_num: 64
>>> metadata pg(pg_num: 64, pgppg_num: 64
>>> rbd pg(pg_num: 64, pgppg_num: 64
>>> volumes pg(pg_num: 128, pgppg_num: 128
>>> images pg(pg_num: 128, pgppg_num: 128
>>>
>>> According to the formula discussed in 'Uneven OSD usage,'
>>>
>>> "The formula is actually OSDs * 100 / replication
>>>
>>> in my case:
>>>
>>> 8*100/2=400
>>>
>>> So I'm erroring on the large size?
>>>
>>> Or, does this formula apply on by pool basis?  Of my 5 pools I'm using
>>> 3:
>>>
>>> root at osd45:~# rados df|cut -c1-45
>>> pool name       category                 KB
>>> data            -                          0
>>> images          -                          0
>>> metadata        -                         10
>>> rbd             -                  568489533
>>> volumes         -                  594078601
>>>   total used      2326235048       285923
>>>   total avail     1380814968
>>>   total space     3707050016
>>>
>>> So should I up the number of PGs for the rbd and volumes pools?
>>>
>>> I'll continue looking at docs, but for now I'll send this off.
>>>
>>> Thanks very much, Christain.
>>>
>>> ps. This cluster is self-contained and all nodes in it are completely
>>> loaded (i.e., I can't add any more nodes nor disks).  It's also not an
>>> option at the moment to upgrade to firefly (can't make a big change
>>> before sending it out the door).
>>>
>>>
>>>
>>> On 9/8/2014 12:09 PM, Christian Balzer wrote:
>>>>
>>>> Hello,
>>>>
>>>> On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote:
>>>>
>>>>> Greetings all,
>>>>>
>>>>> I have a small ceph cluster (4 nodes, 2 osds per node) which recently
>>>>> started showing:
>>>>>
>>>>> root at ocd45:~# ceph health
>>>>> HEALTH_WARN 1 near full osd(s)
>>>>>
>>>>> admin at node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep
>>>>> 'Filesystem|osd/ceph'; done
>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>> /dev/sdc1       442G  249G  194G  57% /var/lib/ceph/osd/ceph-5
>>>>> /dev/sdb1       442G  287G  156G  65% /var/lib/ceph/osd/ceph-1
>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>> /dev/sdc1       442G  396G   47G  90% /var/lib/ceph/osd/ceph-7
>>>>> /dev/sdb1       442G  316G  127G  72% /var/lib/ceph/osd/ceph-3
>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>> /dev/sdb1       442G  229G  214G  52% /var/lib/ceph/osd/ceph-2
>>>>> /dev/sdc1       442G  229G  214G  52% /var/lib/ceph/osd/ceph-6
>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>> /dev/sdc1       442G  238G  205G  54% /var/lib/ceph/osd/ceph-4
>>>>> /dev/sdb1       442G  278G  165G  63% /var/lib/ceph/osd/ceph-0
>>>>>
>>>>>
>>>> See the very recent "Uneven OSD usage" for a discussion about this.
>>>> What are your PG/PGP values?
>>>>
>>>>> This cluster has been running for weeks, under significant load, and
>>>>> has been 100% stable. Unfortunately we have to ship it out of the
>>>>> building to another part of our business (where we will have little
>>>>> access to it).
>>>>>
>>>>> Based on what I've read about 'ceph osd reweight' I'm a bit hesitant
>>>>> to just run it (I don't want to do anything that impacts this
>>>>> cluster's stability).
>>>>>
>>>>> Is there another, better way to equalize the distribution the data on
>>>>> the osd partitions?
>>>>>
>>>>> I'm running dumpling.
>>>>>
>>>> As per the thread and my experience, Firefly would solve this. If you
>>>> can upgrade during a weekend or whenever there is little to no
>>>> access, do it.
>>>>
>>>> Another option (of course any and all of these will result in data
>>>> movement, so pick an appropriate time), would be to "use ceph osd
>>>> reweight" to lower the weight of osd.7 in particular.
>>>>
>>>> Lastly, given the utilization of your cluster, your really ought to
>>>> deploy more OSDs and/or more nodes, if a node would go down you'd
>>>> easily get into a "real" near full or full situation.
>>>>
>>>> Regards,
>>>>
>>>> Christian
>>>>
>>>
>>
> 
> 

-- 
Your electronic communications are being monitored; strong encryption is
an answer. My public key
<http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x4F08C504BD634953>


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux