Re: ceph-mon always election when change crushmap in firefly

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 23 Sep 2015 05:34:33 -0700 (PDT)

On Wed, 23 Sep 2015, Alexander Yang wrote:
> hello,
>         We use Ceph+Openstack in our private cloud. In our cluster, we have
> 5 mons and 800 osds, the Capacity is about 1Pb. And run about 700 vms and
> 1100 volumes,
>         recently, we increase our pg_num , now the cluster have about 70000
> pgs. In my real intention? I want every osd have 100pgs. but after increase
> pg_num, I find I'm wrong. Because the different crush weight for different
> osd, the osd's pg_num is different, some osd have exceed  500pgs.
>         Now, the problem is  appear?cause some reason when i want to change
> some osd  weight, that means change the crushmap.  This change cause about
> 0.03% data to migrate. the mon is always begin to election. It's will hung
> the cluster, and when they end, the  original  leader still is the new
> leader. And during the mon eclection?On the upper layer, vm have too many
> slow request will appear. so now i dare to do any operation about change
> crushmap. But i worry about an important thing, If  when our cluster  down
>  one host even down one rack.   By the time, the cluster curshmap will
> change large, and the migrate data also large. I worry the cluster will
> hung  long time. and result on upper layer, all vm became to  shutdown.
>         In my opinion, I guess when I change the crushmap,* the leader mon
> maybe calculate the too many information*, or* too many client want to get
> the new crushmap from leader mon*.  It must be hung the mon thread, so the
> leader mon can't heatbeat to other mons, the other mons think the leader is
> down then begin the new election.  I am sorry if i guess is wrong.
>         The crushmap in accessory. So who can give me some advice or guide,
> Thanks very much!

There were huge improvements made in hammer in terms of mon efficiency in 
these cases where it is under load.  I recommend upgrading as that will 
help.

You can also mitigate the problem somewhat by adjusting the mon_lease and 
associated settings up.  Scale all of mon_lease, mon_lease_renew_interval, 
mon_lease_ack_timeout, mon_accept_timeout by 2x or 3x.

It also sounds like you may be using some older tunables/settings 
for your pools or crush rules.  Can you attach the output of 'ceph osd 
dump' and 'ceph osd crush dump | tail -n 20' ?

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com