Unable to peer probe after upgrade to 3.3

d.a.bretherton at reading.ac.uk (Dan Bretherton) · Thu, 02 Aug 2012 16:21:47 +0100

Dear All-
My recent upgrade from 3.2.6 to 3.3.0 went well, but now I can't add new 
peers to the cluster.  I can create a new peer group of servers all with 
3.3 freshly installed, but if any one of them was upgraded from 3.2 the 
"gluster peer probe" commands just hang for a while and return nothing. 
Following that, "gluster peer status" results in output like the 
following for the new peer being added.

Hostname: compute-0-4.nerc-essc.ac.uk
Uuid: 111612e4-537b-49b4-9e88-2e0e1bae7fdf
State: Establishing Connection (Connected)

Errors like these are produced in etc-glusterfs-glusterd.vol.log.

[2012-08-02 13:00:53.553927] I 
[glusterd-op-sm.c:2653:glusterd_op_txn_complete] 0-glusterd: Cleared 
local lock
[2012-08-02 15:55:19.244849] I 
[glusterd-handler.c:679:glusterd_handle_cli_probe] 0-glusterd: Received 
CLI probe req compute-0-4.nerc-essc.ac.uk 24007
[2012-08-02 15:55:19.357191] I 
[glusterd-handler.c:423:glusterd_friend_find] 0-glusterd: Unable to find 
hostname: compute-0-4.nerc-essc.ac.uk
[2012-08-02 15:55:19.357261] I 
[glusterd-handler.c:2222:glusterd_probe_begin] 0-glusterd: Unable to 
find peerinfo for host: compute-0-4.nerc-essc.ac.uk (24007)
[2012-08-02 15:55:19.385050] I 
[glusterd-handler.c:2204:glusterd_friend_add] 0-management: connect 
returned 0
[2012-08-02 15:55:19.387162] E [socket.c:1715:socket_connect_finish] 
0-management: connection to  failed (Connection refused)
[2012-08-02 15:55:19.387239] I 
[glusterd-handler.c:2400:glusterd_xfer_cli_probe_resp] 0-glusterd: 
Responded to CLI, ret: 0
[2012-08-02 15:55:19.387274] I [mem-pool.c:576:mem_pool_destroy] 
0-management: size=2236 max=0 total=0
[2012-08-02 15:55:19.387294] I [mem-pool.c:576:mem_pool_destroy] 
0-management: size=124 max=0 total=0
[2012-08-02 15:55:33.026866] I 
[glusterd-handler.c:813:glusterd_handle_cli_list_friends] 0-glusterd: 
Received cli list req
[2012-08-02 15:55:49.766295] I 
[glusterd-handler.c:679:glusterd_handle_cli_probe] 0-glusterd: Received 
CLI probe req compute-0-4.nerc-essc.ac.uk 24007
[2012-08-02 15:55:49.841049] I 
[glusterd-handler.c:423:glusterd_friend_find] 0-glusterd: Unable to find 
hostname: compute-0-4.nerc-essc.ac.uk
[2012-08-02 15:55:49.841101] I 
[glusterd-handler.c:2222:glusterd_probe_begin] 0-glusterd: Unable to 
find peerinfo for host: compute-0-4.nerc-essc.ac.uk (24007)
[2012-08-02 15:55:49.857231] I 
[glusterd-handler.c:2204:glusterd_friend_add] 0-management: connect 
returned 0
[2012-08-02 15:55:49.857804] I 
[glusterd-handshake.c:397:glusterd_set_clnt_mgmt_program] 0-: Using 
Program glusterd mgmt, Num (1238433), Version (2)
[2012-08-02 15:55:49.857840] I 
[glusterd-handshake.c:403:glusterd_set_clnt_mgmt_program] 0-: Using 
Program Peer mgmt, Num (1238437), Version (2)
[2012-08-02 15:55:49.868300] I 
[glusterd-rpc-ops.c:218:glusterd3_1_probe_cbk] 0-glusterd: Received 
probe resp from uuid: 111612e4-537b-49b4-9e88-2e0e1bae7fdf, host: 
compute-0-4.nerc-essc.ac.uk
[2012-08-02 15:55:49.868344] I 
[glusterd-handler.c:411:glusterd_friend_find] 0-glusterd: Unable to find 
peer by uuid
[2012-08-02 15:55:49.868406] E [glusterd-sm.c:1022:glusterd_friend_sm] 
0-glusterd: handler returned: -1
[2012-08-02 15:55:49.868425] I 
[glusterd-rpc-ops.c:286:glusterd3_1_probe_cbk] 0-glusterd: Received resp 
to probe req

In /etc/glusterd/peers a file with the name of the machine being added 
is produced, like this example.

[root at bdan10 peers]# cat compute-0-4.nerc-essc.ac.uk
uuid=00000000-0000-0000-0000-000000000000
state=0
hostname1=compute-0-4.nerc-essc.ac.uk

However the machine in question does have a valid uuid as shown below.

[root at compute-0-4 etc]# cat /var/lib/glusterd/glusterd.info
UUID=111612e4-537b-49b4-9e88-2e0e1bae7fdf

This one had GlusterFS 3.3 freshly installed and was not upgraded from 
3.2.  On this machine the command "gluster peer status" outputs the 
following.

Number of Peers: 1

Hostname: 192.171.166.92
Uuid: 00000000-0000-0000-0000-000000000000
State: Establishing Connection (Connected)

The IP address shown refers to the server where "gluster peer probe" was 
executed.

I tried restarting glusterd on all the servers but it didn't make any 
difference, and doing the "peer probe" from a different server in the 
cluster had the same result.  Has anyone else experienced this problem 
and is there a solution or work-around?  All suggestions would be much 
appreciated.

Regards,
Dan.