peer probe failures

Kenneth Talley <k.talley@xxxxxxxxx> · Mon, 03 Apr 2017 22:35:17 +0000

Hey all,
I've got a strange problem going on here. I've installed glusterfs-server on ubuntu 16.04:
glusterfs-client/xenial,now 3.7.6-1ubuntu1 amd64 [installed,automatic]
glusterfs-common/xenial,now 3.7.6-1ubuntu1 amd64 [installed,automatic]
glusterfs-server/xenial,now 3.7.6-1ubuntu1 amd64 [installed]

I can successfully probe another peer at this point. Then, after installing kubernetes via kargo, peer probing begins failing with a timeout. I've tried stopping all kubernetes related services, and flushing all iptables rules, however I don't see any packets leaving any interface when attempting to peer probe.

from cli.log:
[2017-04-03 22:20:24.704900] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-04-03 22:20:24.704973] T [cli.c:273:cli_rpc_notify] 0-glusterfs: got RPC_CLNT_CONNECT
[2017-04-03 22:20:24.705001] T [cli-quotad-client.c:94:cli_quotad_notify] 0-glusterfs: got RPC_CLNT_CONNECT
[2017-04-03 22:20:24.705014] I [socket.c:2355:socket_event_handler] 0-transport: disconnecting now
[2017-04-03 22:20:24.705204] T [rpc-clnt.c:1404:rpc_clnt_record] 0-glusterfs: Auth Info: pid: 0, uid: 0, gid: 0, owner:
[2017-04-03 22:20:24.705256] T [rpc-clnt.c:1261:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 156, payload: 92, rpc hdr: 64
[2017-04-03 22:20:24.705662] T [socket.c:2879:socket_connect] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f012fd21953] (--> /usr/lib/x86_64-linux-gnu
/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x84)[0x7f012f69add4] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x55)[0x7f012f697af5] (--> /usr/lib/x8
6_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f012f698338] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f012f6945b3] ))))) 0-glusterfs: connect
() called on transport already connected
[2017-04-03 22:20:24.705680] D [rpc-clnt-ping.c:98:rpc_clnt_remove_ping_timer_locked] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f012fd21953] (--> /
usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x84)[0x7f012f69add4] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x55)[0x7f012f
697af5] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f012f698338] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f012f6945b3] )))
)) 0-: /var/run/gluster/quotad.socket: ping timer event already removed
[2017-04-03 22:20:24.705710] T [cli-quotad-client.c:100:cli_quotad_notify] 0-glusterfs: got RPC_CLNT_DISCONNECT
[2017-04-03 22:20:24.705718] T [rpc-clnt.c:1598:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0x1 Program: Gluster CLI, ProgVers: 2, Proc: 1) to rpc-transport (glusterfs)
[2017-04-03 22:20:24.705739] D [rpc-clnt-ping.c:281:rpc_clnt_start_ping] 0-glusterfs: ping timeout is 0, returning
[2017-04-03 22:20:24.705723] D [MSGID: 0] [event-epoll.c:591:event_dispatch_epoll_handler] 0-epoll: generation bumped on idx=1 from gen=1 to slot->gen=2, fd=7, slot->fd=7
[2017-04-03 22:20:27.614881] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
[2017-04-03 22:20:27.615151] T [socket.c:2879:socket_connect] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f012fd21953] (--> /usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/rpc-transport/socket.so(+0x6c1b)[0x7f012a697c1b] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9)[0x7f012f695999] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfc)[0x7f012fd3d70c] (--> /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f012f0b86ba] ))))) 0-glusterfs: connect () called on transport already connected

it then repeats the following:
[2017-04-03 22:20:27.615177] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
[2017-04-03 22:20:27.615188] T [socket.c:2887:socket_connect] 0-glusterfs: connecting 0x25d3550, state=0 gen=0 sock=-1
[2017-04-03 22:20:27.615200] T [name.c:295:af_unix_client_get_remote_sockaddr] 0-glusterfs: using connect-path /var/run/gluster/quotad.socket
[2017-04-03 22:20:27.615218] T [name.c:111:af_unix_client_bind] 0-glusterfs: bind-path not specified for unix socket, letting connect to assign default value
[2017-04-03 22:20:27.615329] T [cli-quotad-client.c:94:cli_quotad_notify] 0-glusterfs: got RPC_CLNT_CONNECT
[2017-04-03 22:20:27.615355] I [socket.c:2355:socket_event_handler] 0-transport: disconnecting now
[2017-04-03 22:20:27.615567] D [rpc-clnt-ping.c:98:rpc_clnt_remove_ping_timer_locked] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f012fd21953] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_remove_ping_timer_locked+0x84)[0x7f012f69add4] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x55)[0x7f012f697af5] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f012f698338] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f012f6945b3] ))))) 0-: /var/run/gluster/quotad.socket: ping timer event already removed
[2017-04-03 22:20:27.615598] T [cli-quotad-client.c:100:cli_quotad_notify] 0-glusterfs: got RPC_CLNT_DISCONNECT
[2017-04-03 22:20:27.615612] D [MSGID: 0] [event-epoll.c:591:event_dispatch_epoll_handler] 0-epoll: generation bumped on idx=1 from gen=4 to slot->gen=5, fd=7, slot->fd=7
[2017-04-03 22:20:30.615864] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect
[2017-04-03 22:20:30.616113] T [socket.c:2879:socket_connect] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f012fd21953] (--> /usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/rpc-transport/socket.so(+0x6c1b)[0x7f012a697c1b] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9)[0x7f012f695999] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfc)[0x7f012fd3d70c] (--> /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f012f0b86ba] ))))) 0-glusterfs: connect () called on transport already connected

and from etc-glusterfs-glusterd.vol.log i see the probe request and then:
[2017-04-03 22:20:24.707258] D [MSGID: 0] [glusterd-peer-utils.c:133:glusterd_peerinfo_find_by_hostname] 0-management: Unable to find friend: 10.1.0.7
[2017-04-03 22:20:24.707260] D [MSGID: 0] [glusterd-peer-utils.c:247:glusterd_peerinfo_find] 0-management: Unable to find hostname: 10.1.0.7
[2017-04-03 22:20:24.707293] I [MSGID: 106129] [glusterd-handler.c:3600:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: 10.1.0.7 (24007)
[2017-04-03 22:20:24.707314] D [MSGID: 0] [glusterd-peer-utils.c:458:glusterd_peer_hostname_new] 0-glusterd: Returning 0
[2017-04-03 22:20:24.707329] D [MSGID: 0] [glusterd-utils.c:5974:glusterd_sm_tr_log_init] 0-glusterd: returning 0
[2017-04-03 22:20:24.707353] D [MSGID: 0] [glusterd-store.c:3802:glusterd_store_create_peer_dir] 0-glusterd: Returning with 0
[2017-04-03 22:20:24.728550] D [MSGID: 0] [store.c:421:gf_store_handle_new] 0-: Returning 0
[2017-04-03 22:20:24.728679] D [MSGID: 0] [store.c:375:gf_store_save_value] 0-management: returning: 0
[2017-04-03 22:20:24.728720] D [logging.c:1952:_gf_msg_internal] 0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to flush least recently used log message to disk
The message "D [MSGID: 0] [store.c:375:gf_store_save_value] 0-management: returning: 0" repeated 2 times between [2017-04-03 22:20:24.728679] and [2017-04-03 22:20:24.728718]
[2017-04-03 22:20:24.728720] D [MSGID: 0] [glusterd-store.c:3931:glusterd_store_peer_write] 0-glusterd: Returning with 0
[2017-04-03 22:20:24.786973] D [MSGID: 0] [glusterd-store.c:3957:glusterd_store_perform_peer_store] 0-glusterd: Returning 0
[2017-04-03 22:20:24.787007] D [MSGID: 0] [glusterd-store.c:3978:glusterd_store_peerinfo] 0-glusterd: Returning with 0
[2017-04-03 22:20:24.787046] D [MSGID: 0] [glusterd-handler.c:3391:glusterd_transport_inet_options_build] 0-glusterd: Returning 0
[2017-04-03 22:20:24.787077] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2017-04-03 22:20:24.787090] D [rpc-clnt.c:996:rpc_clnt_connection_init] 0-management: setting ping-timeout to 30
[2017-04-03 22:20:24.787106] D [rpc-transport.c:288:rpc_transport_load] 0-rpc-transport: attempt to load file /usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/rpc-transport/socket.so
[2017-04-03 22:20:24.787147] T [MSGID: 0] [options.c:91:xlator_option_validate_int] 0-management: no range check required for 'option transport.tcp-user-timeout -1'
[2017-04-03 22:20:24.787166] T [MSGID: 0] [options.c:91:xlator_option_validate_int] 0-management: no range check required for 'option transport.socket.keepalive-time 10'
[2017-04-03 22:20:24.787182] T [MSGID: 0] [options.c:91:xlator_option_validate_int] 0-management: no range check required for 'option transport.socket.keepalive-interval 2'
[2017-04-03 22:20:24.787203] T [MSGID: 0] [options.c:91:xlator_option_validate_int] 0-management: no range check required for 'option remote-port 24007'
[2017-04-03 22:20:24.787221] D [socket.c:3845:socket_init] 0-management: Configued transport.tcp-user-timeout=-1
[2017-04-03 22:20:24.787233] D [socket.c:3928:socket_init] 0-management: SSL support on the I/O path is NOT enabled
[2017-04-03 22:20:24.787242] D [socket.c:3931:socket_init] 0-management: SSL support for glusterd is NOT enabled
[2017-04-03 22:20:24.787251] D [socket.c:3948:socket_init] 0-management: using system polling thread
[2017-04-03 22:20:24.787263] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-management: attempting reconnect
[2017-04-03 22:20:24.787273] T [socket.c:2887:socket_connect] 0-management: connecting 0x7ff7fc005640, state=0 gen=0 sock=-1
[2017-04-03 22:20:24.787288] T [MSGID: 0] [common-utils.c:290:gf_resolve_ip6] 0-resolver: DNS cache not present, freshly probing hostname: 10.1.0.7
[2017-04-03 22:20:24.787843] D [MSGID: 0] [common-utils.c:333:gf_resolve_ip6] 0-resolver: returning ip-10.1.0.7 (port-24007) for hostname: 10.1.0.7 and port: 24007
[2017-04-03 22:20:24.787863] D [socket.c:2833:socket_fix_ssl_opts] 0-management: disabling SSL for portmapper connection
[2017-04-03 22:20:24.787883] T [socket.c:805:__socket_nodelay] 0-management: NODELAY enabled for socket 8
[2017-04-03 22:20:24.787899] W [socket.c:869:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 8, Invalid argument
[2017-04-03 22:20:24.787910] E [socket.c:2965:socket_connect] 0-management: Failed to set keep-alive: Invalid argument
[2017-04-03 22:20:24.788092] D [MSGID: 0] [common-utils.c:2887:gf_ports_reserved] 0-glusterfs: lower: 30000, higher: 32767
[2017-04-03 22:21:52.911012] D [logging.c:1828:gf_log_flush_timeout_cbk] 0-logging-infra: Log timer timed out. About to flush outstanding messages if present
[2017-04-03 22:21:52.911131] D [logging.c:1790:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
[2017-04-03 22:22:24.707002] D [socket.c:558:__socket_rwv] 0-socket.management: EOF on socket
[2017-04-03 22:22:24.707032] D [socket.c:2355:socket_event_handler] 0-transport: disconnecting now

I've debugged this to the best of my ability but haven't made any progress narrowing it down, nor has google turned up anything useful.

Thanks for the help!

Ken

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users