Re: Race in protocol/client and RPC

Shyam Ranganathan <srangana@xxxxxxxxxx> · Thu, 1 Feb 2018 08:48:38 -0500

On 02/01/2018 08:25 AM, Xavi Hernandez wrote:
> After having tried several things, it seems that it will be complex to
> solve these races. All attempts to fix them have caused failures in
> other connections. Since I've other work to do and it doesn't seem to be
> causing serious failures in production, for now I'll leave this. I'll
> retake this when I've more time.

Xavi, convert the findings into a bug, and post the details there, so
that it may be followed up? (if not already done)

> 
> Xavi
> 
> On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez <jahernan@xxxxxxxxxx
> <mailto:jahernan@xxxxxxxxxx>> wrote:
> 
>     Hi all,
> 
>     I've identified a race in RPC layer that caused some spurious
>     disconnections and CHILD_DOWN notifications.
> 
>     The problem happens when protocol/client reconfigures a connection
>     to move from glusterd to glusterfsd. This is done by calling
>     rpc_clnt_reconfig() followed by rpc_transport_disconnect().
> 
>     This seems fine because client_rpc_notify() will call
>     rpc_clnt_cleanup_and_start() when the disconnect notification is
>     received. However There's a problem.
> 
>     Suppose that the disconnection notification has been executed and we
>     are just about to call rpc_clnt_cleanup_and_start(). If at this
>     point the reconnection timer is fired, rpc_clnt_reconnect() will be
>     processed. This will cause the socket to be reconnected and a
>     connection notification will be processed. Then a handshake request
>     will be sent to the server.
> 
>     However, when rpc_clnt_cleanup_and_start() continues, all sent XID's
>     are deleted. When we receive the answer from the handshake, we are
>     unable to map the XID, making the request to fail. So the handshake
>     fails and the client is considered down, sending a CHILD_DOWN
>     notification to upper xlators.
> 
>     This causes, in some tests, to start processing things while a brick
>     is down unexpectedly, causing spurious failures on the test.
> 
>     To solve the problem I've forced the rpc_clnt_reconfig() function to
>     disable the RPC connection using similar code to rcp_clnt_disable().
>     This prevents the background rpc_clnt_reconnect() timer to be
>     executed, avoiding the problem.
> 
>     This seems to work fine for many tests, but it seems to be causing
>     some issue in gfapi based tests. I'm still investigating this.
> 
>     Xavi
> 
> 
> 
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://lists.gluster.org/mailman/listinfo/gluster-devel
> 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel