Den mån 10 sep. 2018 kl 08:10 skrev Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx>:
Update for the list archive.
I went ahead and finished the mimic upgrade with the osds in a fluctuating state of up and down. The cluster did start to normalize a lot easier after everything was on mimic since the random mass OSD heartbeat failures stopped and the constant mon election problem went away. I'm still battling with the cluster reacting poorly to host reboots or small map changes, but I feel like my current pg:osd ratio may be playing a factor in that since we are 2x normal pg count while migrating data to new EC pools.
We found a setting to help us when we had constant reelections, though they were lots more frequent, and not related in the least to Mimic, but bumping the time between elections allowed our cluster to at least start. It voted, decided on a master, the master started (re)playing transactions, got so busy the others called for a new election, same mon won again, restarted the job and repeated over that. Bumping the election to last 30s instead of the default (5?) allowed the mon to finish looking over the things to do and start replying to heartbeats as expected and then it went smoother from there.
mon_lease = 30 for future reference.
May the most significant bit of your life be positive.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com