Re: Mimic upgrade failure

Janne Johansson <icepic.dz@xxxxxxxxx> · Mon, 10 Sep 2018 09:15:10 +0200

Den mån 10 sep. 2018 kl 08:10 skrev Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx>:

    Update for the list archive.

      I went ahead and finished the mimic upgrade with the osds in a
      fluctuating state of up and down. The cluster did start to
      normalize a lot easier after everything was on mimic since the
      random mass OSD heartbeat failures stopped and the constant mon
      election problem went away. I'm still battling with the cluster
      reacting poorly to host reboots or small map changes, but I feel
      like my current pg:osd ratio may be playing a factor in that since
      we are 2x normal pg count while migrating data to new EC pools.

We found a setting to help us when we had constant reelections, though they were lots more frequent, and not related in the least to Mimic, but bumping the time between elections allowed our cluster to at least start. It voted, decided on a master, the master started (re)playing transactions, got so busy the others called for a new election, same mon won again, restarted the job and repeated over that. Bumping the election to last 30s instead of the default (5?) allowed the mon to finish looking over the things to do and start replying to heartbeats as expected and then it went smoother from there.

mon_lease = 30 for future reference.

-- 
May the most significant bit of your life be positive.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com