Hi!I just tried upgrading my test cluster from Mimic (13.2.5) to Nautilus (14.2.0), and everything looked fine. Until I activated msgr2. At that moment, one of my three MONs (the then active one) fell out of the quorum and refuses to join back. The two other MONs seem to work fine.
ceph-mon.log on that host is filling rapidly with messages (mainly from rocksdb), but I can't find any useful information that hint to any problem. On one of the remaining MONs I can see
2019-03-25 11:36:04.081 7fb4add7a700 0 --1- [v2:172.17.0.35:3300/0,v1:172.17.0.35:6789/0] >> v1:172.17.0.37:6789/0 conn(0x559095a5bc00 0x559095a2d800 :6789 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept peer reset, then tried to connect to us, replacing 2019-03-25 11:36:04.152 7fb4ad579700 0 --1- [v2:172.17.0.35:3300/0,v1:172.17.0.35:6789/0] >> v1:172.17.0.37:6789/0 conn(0x559093ce9400 0x559093cec800 :6789 s=OPENED pgs=4422675 cs=1 l=0).fault initiating reconnect
and similar messages, but again with little information for me. 172.17.0.37 is the broken MON, 172.17.0.35 is one of the remaining ones.
ceph.conf on all MONs contains mon host = 172.17.0.35,172.17.0.36,172.17.0.37 i.e. it should work for both v1 and v2. Can anybody tell me how to debug this situation further? Or even solve it? -- Jörn Clausen Daten- und Rechenzentrum GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel Düsternbrookerweg 20 24105 Kiel
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com