Hello,
we are currently in the process of developing a draft specification for
Virtio CAN. In the scope of this work I am developing a Virtio CAN Linux
driver and a Virtio CAN Linux device running on top of our hypervisor
solution.
The Virtio CAN Linux device forwards an existing SocketCAN CAN device
(currently vcan) via Virtio to the Virtio driver guest so that the
virtual driver guest can send and receive CAN frames via SocketCAN.
What was originally planned (probably with too much AUTOSAR CAN driver
semantics in my head and too few SocketCAN knowledge) is to mark a
transmission request as used (done) when it's sent finally on the CAN
bus (vs. when it's given to SocketCAN not really done but still pending
somewhere in the protocol stack).
Thought this was doable with some implementation effort using
setsockopt(..., SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS, ...) and evaluatiing
the MSG_CONFIRM bit on received messages.
This works fine with
cangen -g 0 -i can0
on the driver side sending CAN messages to the device guest. No
confirmation is lost testing for several minutes.
Adding now on the device side a
cangen -g 0 -i vcan0
sending messages like crazy from the device side guest to the driver
side guest in parallel I'm loosing TX confirmations in the Linux CAN
stack. Seems also there is no other error indication (CAN_ERR_FLAG) that
something like this happened. The virtio CAN device gets out of
resources and TX will become stuck. Which is not really acceptable even
for such a heavy load situation (-g0 on both sides).
Is CAN_RAW_RECV_OWN_MSGS / MSG_CONFIRM known as being unreliable (means
MSG_CONFIRM messages are dropped) under extreme load situations? If so,
is there a way to detect reliably that this happened so that somehow a
recovery mechanism for the pending TX acknowledgements could be implemented?
I'm aware that "normal" RX messages from other nodes may be dropped due
to overload. No problem with this.
The timing requirement originally set (done when sent on CAN bus) has to
be weakened or put under a feature flag when it's not reliably
implementable in all environments. But before declaring as "not reliably
implementable with Linux SocketCAN" I would like to be sure that it's
really that way and absolutely nothing can be done about it. Could even
be that I missed an additional setting I'm not aware of. But the
observed behavior may as well be something which is known to everyone
except me.
Of course it can be that there is still a bug in my software but checked
this carefully and I'm now convinced that under heavy load situations
MSG_CONFIRM messages are lost somewhere in the Linux SocketCAN protocol
stack. If there's no way to recover from this situaton I've to weaken
the next draft Virtio CAN draft specification regarding the TX ACK
timing. As this has some additional impact on the specification before
doing so I would like to be really sure that the TX ACK timing cannot be
done reliably the way it was originally planned.
Regards
Harald
--
Dipl.-Ing. Harald Mommer
Senior Software Engineer
OpenSynergy GmbH
Rotherstr. 20, 10245 Berlin
Phone: +49 (30) 60 98 540-0 <== Zentrale
Fax: +49 (30) 60 98 540-99
E-Mail: harald.mommer@xxxxxxxxxxxxxxx
www.opensynergy.com
Handelsregister: Amtsgericht Charlottenburg, HRB 108616B
Geschäftsführer/Managing Director: Regis Adjamah