Re: [PATCH] libceph: clear con->out_msg on Policy::stateful_server faults

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 2020-10-08 at 18:58 +0200, Ilya Dryomov wrote:
> con->out_msg must be cleared on Policy::stateful_server
> (!CEPH_MSG_CONNECT_LOSSY) faults.  Not doing so botches the
> reconnection attempt, because after writing the banner the
> messenger moves on to writing the data section of that message
> (either from where it got interrupted by the connection reset or
> from the beginning) instead of writing struct ceph_msg_connect.
> This results in a bizarre error message because the server
> sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct
> ceph_msg_connect:
> 
>   libceph: mds0 (1)172.21.15.45:6828 socket error on write
>   ceph: mds0 reconnect start
>   libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN)
>   libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32
>   libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch
> 
> AFAICT this bug goes back to the dawn of the kernel client.
> The reason it survived for so long is that only MDS sessions
> are stateful and only two MDS messages have a data section:
> CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare)
> and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved).
> The connection has to get reset precisely when such message
> is being sent -- in this case it was the former.
> 
> Cc: stable@xxxxxxxxxxxxxxx
> Link: https://tracker.ceph.com/issues/47723
> Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx>
> ---
>  net/ceph/messenger.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index e9e2763a255f..c1f1f85545c3 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -2998,6 +2998,11 @@ static void con_fault(struct ceph_connection *con)
>  		ceph_msg_put(con->in_msg);
>  		con->in_msg = NULL;
>  	}
> +	if (con->out_msg) {
> +		BUG_ON(con->out_msg->con != con);
> +		ceph_msg_put(con->out_msg);
> +		con->out_msg = NULL;
> +	}
>  
>  	/* Requeue anything that hasn't been acked */
>  	list_splice_init(&con->out_sent, &con->out_queue);

Nice catch, Ilya.

It might be nice to make a common helper that both reset_connection and
con_fault can call to drop the in_msg/out_msg, but keeping this small
for a stable patch is reasonable. Maybe that can be done as part of the
msgr2 work?
 
Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx>




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux