Re: Understanding monitor requirements

Brian Topping <brian.topping@xxxxxxxxx> · Sat, 11 Apr 2020 16:32:31 -0600

Hi again, after all, this appears to be an MTU issue:

Baseline: 

1) Two of the nodes have a straight ethernet with 1500MTU, the third (problem) node is on a WAN tunnel with a restricted MTU. It appears that the MTUs were not set up correctly, so no surprise some software has problems.
2) I decided I knew Ceph well enough that I could handle recovery from disaster cases in Rook and it has advantages I can use. So please keep that in mind as I discuss this issue. (For those who aren’t familiar, Rook just orchestrates containers that are built by the Ceph team.)
3) In Rook, monitors run as pods under a CNI. The CNI adds additional overhead for transit, in my case a VxLAN overlay network. This overhead is apparently not enough to cause problems when running between nodes on a full 1500MTU local net. So the first two monitors come up cleanly.

After spending a lot of time looking at the logs, I could see the mon map of all three nodes properly distributed, but when it came to an election, all nodes knew the election epoch but the third was not joining. Comparing the logs of the second node as peon with the troubled third node on the other side of the restricted MTU, the difference appeared to be that the third node was not providing a feature proposal when in fact it probably was and it was being dropped. So the election would end without the third node being a part of the quorum. The third node stopped asking for a new election and that’s how things ended.

What I did this morning was figure out the MTU of the WAN tunnel and then change the entire CNI to that number. My expectation was that everything would start working and the necessary fragmentation would be generated by the client end of any connection.

Instead, the second node that was previously able to join as peon was no longer able to do so. It seems to follow that the smaller MTU (1340 to be exact) set on the overall CNI causes the elections to fail. 

There are a number of things that I can do to improve the behavior of the cluster (such as PMTUD), but if Ceph is not going to work with a small MTU, all bets are off.

I tried looking for issues in tracker.ceph.com, but apparently I haven’t logged in there for a while and my account was deleted. I applied for a new one. 

Any ideas what I can do here?

Thanks! Brian
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx