Hello, I've created an Octopus 15.2.4 cluster with 3 monitors and 3 OSDs (6 hosts in total, all ESXi VMs). It lived through a couple of reboots without problem, then I've reconfigured the main host a bit: set iptables-legacy as current option in update-alternatives (this is a Debian10 system), applied a basic ruleset of iptables and restarted docker. After that the cluster became unresponsive (any ceph command hangs indefinitely). I can use admin socket to manipulate config though. Setting debug_ms to 5 I see this in the logs (timestamps cut for readability): 7f4096f41700 5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >> [v2:<mon2_ip>:3300/0,v1:<mon2_ip>:6789/0] conn(0x55c21b975800 0x55c21ab45180 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rx=0 tx= 0).send_message enqueueing message m=0x55c21bd84a00 type=67 mon_probe(probe e30397f0-cc32-11ea-8c8e-000c29469cd5 name mon1 mon_release octopus) v7 7f4098744700 1 -- >> [v2:<mon1_ip>:6800/561959008,v1:<mon1_ip>:6801/561959008] conn(0x55c21b974400 msgr2=0x55c21ab45600 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v2:81.200.2 .152:6800/561959008 7f4098744700 2 -- >> [v2:<mon1_ip>:6800/561959008,v1:<mon1_ip>:6801/561959008] conn(0x55c21b974400 msgr2=0x55c21ab45600 unknown :-1 s=STATE_CONNECTING_RE l=0).process connection refused! and this: 7f4098744700 2 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >> conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=1 rx=0 tx=0)._fault on lossy channel, failing 7f4098744700 1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >> conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=1 rx=0 tx=0).stop 7f4098744700 5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >> conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=1 rx=0 tx=0).reset_recv_state 7f4098744700 5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >> conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0 cs=0 l=1 rx=0 tx=0).reset_security 7f409373a700 1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >> conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).accept 7f4098744700 1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >> conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0 7f4098744700 5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >> conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=HELLO_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=8 peer_addr_for_me=v2:<mon1_ip>:3300/0 7f4098744700 5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >> conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=HELLO_ACCEPTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello getsockname says I am <mon1_ip>:3300 when talking to v2:<mon1_ip>:49012/0 7f4098744700 1 mon.mon1@0(probing) e5 handle_auth_request failed to assign global_id Config (the result of ceph --admin-daemon /run/ceph/e30397f0-cc32-11ea-8c8e-000c29469cd5/ceph-mon.mon1.asok config show): https://pastebin.com/kifMXs9H I can connect to ports 3300 and 6789 with telnet; 6800 and 6801 return 'process connection refused' Setting all iptables policies to ACCEPT didn't change anything. Where should I start digging to fix this problem? I'd like to at least understand why this happened before putting the cluster into production. Any help is appreciated. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx