Re: Race in protocol/client and RPC

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 1, 2018 at 2:48 PM, Shyam Ranganathan <srangana@xxxxxxxxxx> wrote:
On 02/01/2018 08:25 AM, Xavi Hernandez wrote:
> After having tried several things, it seems that it will be complex to
> solve these races. All attempts to fix them have caused failures in
> other connections. Since I've other work to do and it doesn't seem to be
> causing serious failures in production, for now I'll leave this. I'll
> retake this when I've more time.

Xavi, convert the findings into a bug, and post the details there, so
that it may be followed up? (if not already done)

I've just created this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1541032


>
> Xavi
>
> On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez <jahernan@xxxxxxxxxx
> <mailto:jahernan@xxxxxxxxxx>> wrote:
>
>     Hi all,
>
>     I've identified a race in RPC layer that caused some spurious
>     disconnections and CHILD_DOWN notifications.
>
>     The problem happens when protocol/client reconfigures a connection
>     to move from glusterd to glusterfsd. This is done by calling
>     rpc_clnt_reconfig() followed by rpc_transport_disconnect().
>
>     This seems fine because client_rpc_notify() will call
>     rpc_clnt_cleanup_and_start() when the disconnect notification is
>     received. However There's a problem.
>
>     Suppose that the disconnection notification has been executed and we
>     are just about to call rpc_clnt_cleanup_and_start(). If at this
>     point the reconnection timer is fired, rpc_clnt_reconnect() will be
>     processed. This will cause the socket to be reconnected and a
>     connection notification will be processed. Then a handshake request
>     will be sent to the server.
>
>     However, when rpc_clnt_cleanup_and_start() continues, all sent XID's
>     are deleted. When we receive the answer from the handshake, we are
>     unable to map the XID, making the request to fail. So the handshake
>     fails and the client is considered down, sending a CHILD_DOWN
>     notification to upper xlators.
>
>     This causes, in some tests, to start processing things while a brick
>     is down unexpectedly, causing spurious failures on the test.
>
>     To solve the problem I've forced the rpc_clnt_reconfig() function to
>     disable the RPC connection using similar code to rcp_clnt_disable().
>     This prevents the background rpc_clnt_reconnect() timer to be
>     executed, avoiding the problem.
>
>     This seems to work fine for many tests, but it seems to be causing
>     some issue in gfapi based tests. I'm still investigating this.
>
>     Xavi
>
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux