Re: failed lossy con, dropping message

Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> · Thu, 13 Apr 2017 12:05:27 +0300

Hello Greg,

Thank you for the answer.
I'm still in doubt with the "lossy". What does it mean in this context? I can think of different variants:
1. The designer of the protocol from start is considering the connection to be "lossy" so the connection errors are handled in a higher layer. So the layer that has observed the failure of the connection is just logging this event and will let the upper layer to handle it. This would support your statement 'since it's a "lossy" connection we don't need to remember the message and resend it.'

2. A connection is not declared "lossy" as long as it is working properly. Once it ha lost some packets or some error threshold is reached, we declare the connection as being lossy, inform the higher layer, and let it decide what next. Compared with point 1. the actions are quite similar, but the usage of the "lossy" is different. At point 1. a connection is always "lossy" even if it is not losing any packet actually. In the second case the connection will became "lossy" when the errors will appear, so "lossy" is a runtime state of the connection.

Maybe both are wrong and the truth is a third variant ... :) This is what I would like to understand.

Kind regards,
Laszlo

On 13.04.2017 00:36, Gregory Farnum wrote:
On Wed, Apr 12, 2017 at 3:00 AM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote:
Hello,

yesterday one of our compute nodes has recorded the following message for
one of the ceph connections:

submit_message osd_op(client.28817736.0:690186
rbd_data.15c046b11ab57b7.00000000000000c4 [read 2097152~380928] 3.6f81364a
ack+read+known_if_redirected e3617) v5 remote, 10.12.68.71:6818/6623, failed
lossy con, dropping message

A read message, sent to the OSD at IP 10.12.68.71:6818/6623, is being
dropped because the connection has somehow failed; since it's a
"lossy" connection we don't need to remember the message and resend
it. That failure could be an actual TCP/IP stack error; it could be
because a different thread killed the connection and it's now closed.

If you've just got one of these and didn't see other problems, it's
innocuous — I expect the most common cause for this is an OSD getting
marked down while IO is pending to it. :)
-Greg

Can someone "decode" the above message, or direct me to some document where
I could read more about it?

We have ceph 0.94.10.

Thank you,
Laszlo
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com