Re: Changing pg_num => RBD VM down !

Chu Duc Minh <chu.ducminh@xxxxxxxxx> · Mon, 16 Mar 2015 22:01:17 +0700

@Michael Kuriger: when ceph/librbd operate normally, i know that double the pg_num is the safe way. But when it has problem, i think double it can make many many VMs die (maybe >= 50%?)

On Mon, Mar 16, 2015 at 9:53 PM, Michael Kuriger <mk7193@xxxxxx> wrote:

I always keep my pg number a power of 2.  So I’d go from 2048 to 4096.  I’m not sure if this is the safest way, but it’s worked for me.

Michael Kuriger

Sr. Unix Systems Engineer

* mk7193@xxxxxx |( 818-649-7235

From: Chu Duc Minh <chu.ducminh@xxxxxxxxx>

Date: Monday, March 16, 2015 at 7:49 AM

To: Florent B <florent@xxxxxxxxxxx>

Cc: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>

Subject: Re: Changing pg_num => RBD VM down !

I'm using the latest Giant and have the same issue. When i increase PG_num of a pool from 2048 to 2148, my VMs is still ok. When i increase from 2148 to 2400, some VMs die (Qemu-kvm process die).

My physical servers (host VMs) running kernel 3.13 and use librbd.

I think it's a bug in librbd with crushmap. 

(I set crush_tunables3 on my ceph cluster, does it make sense?)

Do you know a way to safely increase PG_num? (I don't think increase PG_num 100 each times is a safe & good way)

Regards,

On Mon, Mar 16, 2015 at 8:50 PM, Florent B 
<florent@xxxxxxxxxxx> wrote:

We are on Giant.

On 03/16/2015 02:03 PM, Azad Aliyar wrote:

>

> May I know your ceph version.?. The latest version of firefly 80.9 has

> patches to avoid excessive data migrations during rewighting osds. You

> may need set a tunable inorder make this patch active.

>

> This is a bugfix release for firefly.  It fixes a performance regression

> in librbd, an important CRUSH misbehavior (see below), and several RGW

> bugs.  We have also backported support for flock/fcntl locks to ceph-fuse

> and libcephfs.

>

> We recommend that all Firefly users upgrade.

>

> For more detailed information, see

>   http://docs.ceph.com/docs/master/_downloads/v0.80.9.txt

>

> Adjusting CRUSH maps

> --------------------

>

> * This point release fixes several issues with CRUSH that trigger

>   excessive data migration when adjusting OSD weights.  These are most

>   obvious when a very small weight change (e.g., a change from 0 to

>   .01) triggers a large amount of movement, but the same set of bugs

>   can also lead to excessive (though less noticeable) movement in

>   other cases.

>

>   However, because the bug may already have affected your cluster,

>   fixing it may trigger movement *back* to the more correct location.

>   For this reason, you must manually opt-in to the fixed behavior.

>

>   In order to set the new tunable to correct the behavior::

>

>      ceph osd crush set-tunable straw_calc_version 1

>

>   Note that this change will have no immediate effect.  However, from

>   this point forward, any 'straw' bucket in your CRUSH map that is

>   adjusted will get non-buggy internal weights, and that transition

>   may trigger some rebalancing.

>

>   You can estimate how much rebalancing will eventually be necessary

>   on your cluster with::

>

>      ceph osd getcrushmap -o /tmp/cm

>      crushtool -i /tmp/cm --num-rep 3 --test --show-mappings > /tmp/a 2>&1

>      crushtool -i /tmp/cm --set-straw-calc-version 1 -o /tmp/cm2

>      crushtool -i /tmp/cm2 --reweight -o /tmp/cm2

>      crushtool -i /tmp/cm2 --num-rep 3 --test --show-mappings > /tmp/b

> 2>&1

>      wc -l /tmp/a                          # num total mappings

>      diff -u /tmp/a /tmp/b | grep -c ^+    # num changed mappings

>

>    Divide the total number of lines in /tmp/a with the number of lines

>    changed.  We've found that most clusters are under 10%.

>

>    You can force all of this rebalancing to happen at once with::

>

>      ceph osd crush reweight-all

>

>    Otherwise, it will happen at some unknown point in the future when

>    CRUSH weights are next adjusted.

>

> Notable Changes

> ---------------

>

> * ceph-fuse: flock, fcntl lock support (Yan, Zheng, Greg Farnum)

> * crush: fix straw bucket weight calculation, add straw_calc_version

>   tunable (#10095 Sage Weil)

> * crush: fix tree bucket (Rongzu Zhu)

> * crush: fix underflow of tree weights (Loic Dachary, Sage Weil)

> * crushtool: add --reweight (Sage Weil)

> * librbd: complete pending operations before losing image (#10299 Jason

>   Dillaman)

> * librbd: fix read caching performance regression (#9854 Jason Dillaman)

> * librbd: gracefully handle deleted/renamed pools (#10270 Jason Dillaman)

> * mon: fix dump of chooseleaf_vary_r tunable (Sage Weil)

> * osd: fix PG ref leak in snaptrimmer on peering (#10421 Kefu Chai)

> * osd: handle no-op write with snapshot (#10262 Sage Weil)

> * radosgw-admi

>

>

>

>

> On 03/16/2015 12:37 PM, Alexandre DERUMIER wrote:

> >>> VMs are running on the same nodes than OSD

> > Are you sure that you didn't some kind of out of memory.

> > pg rebalance can be memory hungry. (depend how many osd you have).

>

> 2 OSD per host, and 5 hosts in this cluster.

> hosts h

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com