Is ceph osd reweight always safe to use?

chibi@xxxxxxx (Christian Balzer) · Tue, 9 Sep 2014 10:38:29 +0900



Hello,

On Mon, 08 Sep 2014 13:50:08 -0400 JR wrote:

> Hi Christian,
> 
> I have 448 PGs and 448 PGPs (according to ceph -s).
> 
> This seems borne out by:
> 
> root at osd45:~# rados lspools
> data
> metadata
> rbd
> volumes
> images
> root at osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd pool
> get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done
> data pg(pg_num: 64, pgppg_num: 64
> metadata pg(pg_num: 64, pgppg_num: 64
> rbd pg(pg_num: 64, pgppg_num: 64
> volumes pg(pg_num: 128, pgppg_num: 128
> images pg(pg_num: 128, pgppg_num: 128
> 
> According to the formula discussed in 'Uneven OSD usage,'
> 
> "The formula is actually OSDs * 100 / replication
> 
> in my case:
> 
> 8*100/2=400
> 
> So I'm erroring on the large size?
>
No, because for starters in the documentation (and if I recall correctly
also in that thread) the suggestion is to round up to nearest power of 2.
That would be 512 in your case, but with really small clusters
overprovisioning PGs/PGPs makes a lot of sense.

> Or, does this formula apply on by pool basis?  Of my 5 pools I'm using 3:
>
Not strictly, but clearly only used pools will have any impact and
benefit from this.
 
> root at nebula45:~# rados df|cut -c1-45
> pool name       category                 KB
> data            -                          0
> images          -                          0
> metadata        -                         10
> rbd             -                  568489533
> volumes         -                  594078601
>   total used      2326235048       285923
>   total avail     1380814968
>   total space     3707050016
> 
> So should I up the number of PGs for the rbd and volumes pools?
> 
Definitely. 256 at least, but I personally would even go for 512,
especially with dumpling.

Christian

> I'll continue looking at docs, but for now I'll send this off.
> 
> Thanks very much, Christain.
> 
> ps. This cluster is self-contained and all nodes in it are completely
> loaded (i.e., I can't add any more nodes nor disks).  It's also not an
> option at the moment to upgrade to firefly (can't make a big change
> before sending it out the door).
> 
> 
> 
> On 9/8/2014 12:09 PM, Christian Balzer wrote:
> > 
> > Hello,
> > 
> > On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote:
> > 
> >> Greetings all,
> >>
> >> I have a small ceph cluster (4 nodes, 2 osds per node) which recently
> >> started showing:
> >>
> >> root at ocd45:~# ceph health
> >> HEALTH_WARN 1 near full osd(s)
> >>
> >> admin at node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep
> >> 'Filesystem|osd/ceph'; done
> >> Filesystem      Size  Used Avail Use% Mounted on
> >> /dev/sdc1       442G  249G  194G  57% /var/lib/ceph/osd/ceph-5
> >> /dev/sdb1       442G  287G  156G  65% /var/lib/ceph/osd/ceph-1
> >> Filesystem      Size  Used Avail Use% Mounted on
> >> /dev/sdc1       442G  396G   47G  90% /var/lib/ceph/osd/ceph-7
> >> /dev/sdb1       442G  316G  127G  72% /var/lib/ceph/osd/ceph-3
> >> Filesystem      Size  Used Avail Use% Mounted on
> >> /dev/sdb1       442G  229G  214G  52% /var/lib/ceph/osd/ceph-2
> >> /dev/sdc1       442G  229G  214G  52% /var/lib/ceph/osd/ceph-6
> >> Filesystem      Size  Used Avail Use% Mounted on
> >> /dev/sdc1       442G  238G  205G  54% /var/lib/ceph/osd/ceph-4
> >> /dev/sdb1       442G  278G  165G  63% /var/lib/ceph/osd/ceph-0
> >>
> >>
> > See the very recent "Uneven OSD usage" for a discussion about this.
> > What are your PG/PGP values?
> > 
> >> This cluster has been running for weeks, under significant load, and
> >> has been 100% stable. Unfortunately we have to ship it out of the
> >> building to another part of our business (where we will have little
> >> access to it).
> >>
> >> Based on what I've read about 'ceph osd reweight' I'm a bit hesitant
> >> to just run it (I don't want to do anything that impacts this
> >> cluster's stability).
> >>
> >> Is there another, better way to equalize the distribution the data on
> >> the osd partitions?
> >>
> >> I'm running dumpling.
> >>
> > As per the thread and my experience, Firefly would solve this. If you
> > can upgrade during a weekend or whenever there is little to no access,
> > do it.
> > 
> > Another option (of course any and all of these will result in data
> > movement, so pick an appropriate time), would be to "use ceph osd
> > reweight" to lower the weight of osd.7 in particular.
> > 
> > Lastly, given the utilization of your cluster, your really ought to
> > deploy more OSDs and/or more nodes, if a node would go down you'd
> > easily get into a "real" near full or full situation.
> > 
> > Regards,
> > 
> > Christian
> > 
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/