On 02/01/2018 08:25 AM, Xavi Hernandez wrote: > After having tried several things, it seems that it will be complex to > solve these races. All attempts to fix them have caused failures in > other connections. Since I've other work to do and it doesn't seem to be > causing serious failures in production, for now I'll leave this. I'll > retake this when I've more time. Xavi, convert the findings into a bug, and post the details there, so that it may be followed up? (if not already done) > > Xavi > > On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez <jahernan@xxxxxxxxxx > <mailto:jahernan@xxxxxxxxxx>> wrote: > > Hi all, > > I've identified a race in RPC layer that caused some spurious > disconnections and CHILD_DOWN notifications. > > The problem happens when protocol/client reconfigures a connection > to move from glusterd to glusterfsd. This is done by calling > rpc_clnt_reconfig() followed by rpc_transport_disconnect(). > > This seems fine because client_rpc_notify() will call > rpc_clnt_cleanup_and_start() when the disconnect notification is > received. However There's a problem. > > Suppose that the disconnection notification has been executed and we > are just about to call rpc_clnt_cleanup_and_start(). If at this > point the reconnection timer is fired, rpc_clnt_reconnect() will be > processed. This will cause the socket to be reconnected and a > connection notification will be processed. Then a handshake request > will be sent to the server. > > However, when rpc_clnt_cleanup_and_start() continues, all sent XID's > are deleted. When we receive the answer from the handshake, we are > unable to map the XID, making the request to fail. So the handshake > fails and the client is considered down, sending a CHILD_DOWN > notification to upper xlators. > > This causes, in some tests, to start processing things while a brick > is down unexpectedly, causing spurious failures on the test. > > To solve the problem I've forced the rpc_clnt_reconfig() function to > disable the RPC connection using similar code to rcp_clnt_disable(). > This prevents the background rpc_clnt_reconnect() timer to be > executed, avoiding the problem. > > This seems to work fine for many tests, but it seems to be causing > some issue in gfapi based tests. I'm still investigating this. > > Xavi > > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://lists.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel