On Thu, Feb 1, 2018 at 2:48 PM, Shyam Ranganathan <srangana@xxxxxxxxxx> wrote:
On 02/01/2018 08:25 AM, Xavi Hernandez wrote:
> After having tried several things, it seems that it will be complex to
> solve these races. All attempts to fix them have caused failures in
> other connections. Since I've other work to do and it doesn't seem to be
> causing serious failures in production, for now I'll leave this. I'll
> retake this when I've more time.
Xavi, convert the findings into a bug, and post the details there, so
that it may be followed up? (if not already done)
I've just created this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1541032
>
> Xavi
>
> On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez <jahernan@xxxxxxxxxx
> <mailto:jahernan@xxxxxxxxxx>> wrote:
>
> Hi all,
>
> I've identified a race in RPC layer that caused some spurious
> disconnections and CHILD_DOWN notifications.
>
> The problem happens when protocol/client reconfigures a connection
> to move from glusterd to glusterfsd. This is done by calling
> rpc_clnt_reconfig() followed by rpc_transport_disconnect().
>
> This seems fine because client_rpc_notify() will call
> rpc_clnt_cleanup_and_start() when the disconnect notification is
> received. However There's a problem.
>
> Suppose that the disconnection notification has been executed and we
> are just about to call rpc_clnt_cleanup_and_start(). If at this
> point the reconnection timer is fired, rpc_clnt_reconnect() will be
> processed. This will cause the socket to be reconnected and a
> connection notification will be processed. Then a handshake request
> will be sent to the server.
>
> However, when rpc_clnt_cleanup_and_start() continues, all sent XID's
> are deleted. When we receive the answer from the handshake, we are
> unable to map the XID, making the request to fail. So the handshake
> fails and the client is considered down, sending a CHILD_DOWN
> notification to upper xlators.
>
> This causes, in some tests, to start processing things while a brick
> is down unexpectedly, causing spurious failures on the test.
>
> To solve the problem I've forced the rpc_clnt_reconfig() function to
> disable the RPC connection using similar code to rcp_clnt_disable().
> This prevents the background rpc_clnt_reconnect() timer to be
> executed, avoiding the problem.
>
> This seems to work fine for many tests, but it seems to be causing
> some issue in gfapi based tests. I'm still investigating this.
>
> Xavi
>
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel