Re: failed lossy con, dropping message

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 13 Apr 2017 13:22:36 +0000

On Thu, Apr 13, 2017 at 2:17 AM Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote:
Hello Greg,

Thank you for the answer.

I'm still in doubt with the "lossy". What does it mean in this context? I can think of different variants:

1. The designer of the protocol from start is considering the connection to be "lossy" so the connection errors are handled in a higher layer. So the layer that has observed the failure of the connection is just logging this event and will let the upper layer to handle it. This would support your statement 'since it's a "lossy" connection we don't need to remember the message and resend it.'

This one. :)
The messenger subsystem can be configured as lossy or non-lossy; all the RADOS connecrions are lossy since a failure frequently means we'll have for etargwt the operation anyway (to a different OSD). CephFS uses the state full connections a bit more.
-Greg

2. A connection is not declared "lossy" as long as it is working properly. Once it ha lost some packets or some error threshold is reached, we declare the connection as being lossy, inform the higher layer, and let it decide what next. Compared with point 1. the actions are quite similar, but the usage of the "lossy" is different. At point 1. a connection is always "lossy" even if it is not losing any packet actually. In the second case the connection will became "lossy" when the errors will appear, so "lossy" is a runtime state of the connection.

Maybe both are wrong and the truth is a third variant ... :) This is what I would like to understand.

Kind regards,

Laszlo

On 13.04.2017 00:36, Gregory Farnum wrote:

> On Wed, Apr 12, 2017 at 3:00 AM, Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote:

>> Hello,

>>

>> yesterday one of our compute nodes has recorded the following message for

>> one of the ceph connections:

>>

>> submit_message osd_op(client.28817736.0:690186

>> rbd_data.15c046b11ab57b7.00000000000000c4 [read 2097152~380928] 3.6f81364a

>> ack+read+known_if_redirected e3617) v5 remote, 10.12.68.71:6818/6623, failed

>> lossy con, dropping message

>

> A read message, sent to the OSD at IP 10.12.68.71:6818/6623, is being

> dropped because the connection has somehow failed; since it's a

> "lossy" connection we don't need to remember the message and resend

> it. That failure could be an actual TCP/IP stack error; it could be

> because a different thread killed the connection and it's now closed.

>

> If you've just got one of these and didn't see other problems, it's

> innocuous — I expect the most common cause for this is an OSD getting

> marked down while IO is pending to it. :)

> -Greg

>

>>

>> Can someone "decode" the above message, or direct me to some document where

>> I could read more about it?

>>

>> We have ceph 0.94.10.

>>

>> Thank you,

>> Laszlo

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com