Re: MSG_CONFIRM RX messages with SocketCAN known as unreliable under heavy load?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 17.06.2021 14:22:03, Harald Mommer wrote:
> we are currently in the process of developing a draft specification for
> Virtio CAN. In the scope of this work I am developing a Virtio CAN Linux
> driver and a Virtio CAN Linux device

Oh that sounds interesting. Please keep the linux-can mailing list in
the loop. Do you have a first draft version for review, yet?

> running on top of our hypervisor solution.
> 
> The Virtio CAN Linux device forwards an existing SocketCAN CAN device
> (currently vcan) via Virtio to the Virtio driver guest so that the virtual
> driver guest can send and receive CAN frames via SocketCAN.
> 
> What was originally planned (probably with too much AUTOSAR CAN driver
> semantics in my head and too few SocketCAN knowledge) is to mark a
> transmission request as used (done) when it's sent finally on the CAN bus
> (vs. when it's given to SocketCAN not really done but still pending
> somewhere in the protocol stack).

Makes sense.

> Thought this was doable with some implementation effort using
> 
> setsockopt(..., SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS, ...) and evaluatiing the
> MSG_CONFIRM bit on received messages.

Where does that code run? Would that be part of qemu running on the host
of an open source solution?

Can you sketch a quick block diagram showing guest, host, Virtio device,
Virtio driver, etc...

> This works fine with
> 
> cangen -g 0 -i can0
> 
> on the driver side sending CAN messages to the device guest. No confirmation
> is lost testing for several minutes.

Where's the driver side? On the host or the guest?

> Adding now on the device side a
> 
> cangen -g 0 -i vcan0
> 
> sending messages like crazy from the device side guest to the driver side
> guest in parallel I'm loosing TX confirmations in the Linux CAN stack. Seems
> also there is no other error indication (CAN_ERR_FLAG) that something like

CAN_ERR_FLAG are only for real CAN errors on the bus or controller
problems. The vcan interface doesn't generate any.

> this happened. The virtio CAN device gets out of resources and TX will
> become stuck. Which is not really acceptable even for such a heavy load
> situation (-g0 on both sides).
> 
> Is CAN_RAW_RECV_OWN_MSGS / MSG_CONFIRM known as being unreliable (means
> MSG_CONFIRM messages are dropped) under extreme load situations? If so, is
> there a way to detect reliably that this happened so that somehow a recovery
> mechanism for the pending TX acknowledgements could be implemented?

Have you activated SO_RXQ_OVFL?
With recvmsg() you get the number of dropped messages in the socket.
Have a look at:
https://github.com/linux-can/can-utils/blob/master/cansequence.c

> I'm aware that "normal" RX messages from other nodes may be dropped due to
> overload. No problem with this.
> 
> The timing requirement originally set (done when sent on CAN bus) has to be
> weakened or put under a feature flag when it's not reliably implementable in
> all environments.

Even if the Linux Kernel doesn't drop any messages, not all CAN
controllers support that feature. On the Linux side we try our best, but
some USB attached devices don't report a TX complete event back, so the
driver triggers the CAN echo skb after the USB transfer has been
completed.

We don't have a feature flag to query if the Linux driver support proper
CAN echo on TX complete notification.

> But before declaring as "not reliably implementable with
> Linux SocketCAN" I would like to be sure that it's really that way and
> absolutely nothing can be done about it. Could even be that I missed an
> additional setting I'm not aware of. But the observed behavior may as well
> be something which is known to everyone except me.
> 
> Of course it can be that there is still a bug in my software but checked
> this carefully and I'm now convinced that under heavy load situations
> MSG_CONFIRM messages are lost somewhere in the Linux SocketCAN protocol
> stack. If there's no way to recover from this situaton I've to weaken the
> next draft Virtio CAN draft specification regarding the TX ACK timing. As
> this has some additional impact on the specification before doing so I would
> like to be really sure that the TX ACK timing cannot be done reliably the
> way it was originally planned.

Do you have some code available yet?

regards,
Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Automotive Discussions]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]     [CAN Bus]

  Powered by Linux