Hi all,
I've identified a race in RPC layer that caused some spurious disconnections and CHILD_DOWN notifications.
The problem happens when protocol/client reconfigures a connection to move from glusterd to glusterfsd. This is done by calling rpc_clnt_reconfig() followed by rpc_transport_disconnect().
This seems fine because client_rpc_notify() will call rpc_clnt_cleanup_and_start() when the disconnect notification is received. However There's a problem.
Suppose that the disconnection notification has been executed and we are just about to call rpc_clnt_cleanup_and_start(). If at this point the reconnection timer is fired, rpc_clnt_reconnect() will be processed. This will cause the socket to be reconnected and a connection notification will be processed. Then a handshake request will be sent to the server.
However, when rpc_clnt_cleanup_and_start() continues, all sent XID's are deleted. When we receive the answer from the handshake, we are unable to map the XID, making the request to fail. So the handshake fails and the client is considered down, sending a CHILD_DOWN notification to upper xlators.
This causes, in some tests, to start processing things while a brick is down unexpectedly, causing spurious failures on the test.
To solve the problem I've forced the rpc_clnt_reconfig() function to disable the RPC connection using similar code to rcp_clnt_disable(). This prevents the background rpc_clnt_reconnect() timer to be executed, avoiding the problem.
This seems to work fine for many tests, but it seems to be causing some issue in gfapi based tests. I'm still investigating this.
Xavi
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel