Re: MSG_CONFIRM RX messages with SocketCAN known as unreliable under heavy load?

Harald Mommer <hmo@xxxxxxxxxxxxxxx> · Thu, 24 Jun 2021 17:21:15 +0200

Hello,

Am 18.06.21 um 11:16 schrieb Marc Kleine-Budde:
On 17.06.2021 14:22:03, Harald Mommer wrote:
we are currently in the process of developing a draft specification for
Virtio CAN. In the scope of this work I am developing a Virtio CAN Linux
driver and a Virtio CAN Linux device
Oh that sounds interesting. Please keep the linux-can mailing list in
the loop. Do you have a first draft version for review, yet?

First draft went to virtio-comment@xxxxxxxxxxxxxxxxxxxx and 
virtio-dev@xxxxxxxxxxxxxxxxxxxx.

https://markmail.org/search/?q=virtio-can&q=list%3Aorg.oasis-open.lists.virtio-comment#query:virtio-can%20list%3Aorg.oasis-open.lists.virtio-comment+page:1+mid:hdxj35fsthypllkt+state:results

Link should reveal the short conversation. Currently working on the next 
draft which incorporates the review comments I got so far but the next 
draft will also address the "TX ACK" problem we are discussing here.

In the future I will put the Linux-CAN list in the loop.

running on top of our hypervisor solution.

The Virtio CAN Linux device forwards an existing SocketCAN CAN device
(currently vcan) via Virtio to the Virtio driver guest so that the virtual
driver guest can send and receive CAN frames via SocketCAN.

What was originally planned (probably with too much AUTOSAR CAN driver
semantics in my head and too few SocketCAN knowledge) is to mark a
transmission request as used (done) when it's sent finally on the CAN bus
(vs. when it's given to SocketCAN not really done but still pending
somewhere in the protocol stack).
Makes sense.

Reading the "Makes sense". But reading also the rest of the E-Mail (and 
the thread) it makes the impression that making this timing requirement 
mandatory using SocketCAN is calling for trouble.

- Could remove the timing requirement. This is the easy solution. But 
there is the "Makes sense".

- The original strict timing requirement becomes an option so it's not a 
mandatory requirement.

2nd is my favorite (but I tend to do over engineering in the first shot 
so the option before may be indeed the better one).

Not having this timing behavior has the implication that in the next 
virtio draft spec some other things have to be changed and this means 
now simplified.

Thought this was doable with some implementation effort using

setsockopt(..., SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS, ...) and evaluatiing the
MSG_CONFIRM bit on received messages.
Where does that code run? Would that be part of qemu running on the host
of an open source solution?
The device application is closed source, runs under the COQOS hypervisor 
which is also closed source. A qemu device implementation is not planned 
as of now. The virtio CAN driver is a Linux device driver and will be 
open sourced at some point in time in the hope to get it upstreamed in a 
more far away future. Currently the driver is on an internal development 
branch, outsiders cannot see it (still better for everyone) and the 
colleagues are reviewing helping to bring it into an acceptable shape.
Can you sketch a quick block diagram showing guest, host, Virtio device,
Virtio driver, etc...

I hope this arrives on the list as is been sent and not garbled:

     Guest 2                    | Guest3
----------------                | ----------------
! cangen,      !                | ! cangen,      !
! candump,     !                | ! candump,     !
! cansend      !                | ! cansend      !
! using vcan0  !                | ! using can0   !
----------------                | ----------------
 ^                              |             ^
 !  ---------------------       |             !
 !  ! Service process   !       |             !
 !  ! in user space     !       |             !
 !  ! virtio-can device !       |             !
 !  ! forwarding vcan0  !       |             !
 !  ---------------------       |             !
 !    ^               ^         |             !
 !    !               !         |             !
--------------------------------------------------
 !    !   Device side ! kernel  | Driver side ! kernel
 v    v               v         |             v
---------------- -------------- | ----------------
! Device Linux ! ! HV support ! | ! Driver Linux !
!    VCan      ! !   module   ! | !  Virtio CAN  !
!    vcan0     ! ! on device  ! | !     can0     !
!              ! !   side     ! | !              !
---------------- -------------- | ----------------
       ^               ^        |        ^
       !               !        |        !
--------------------------------------------------
       !               !                 ! Hypervisor
       v               v                 v
--------------------------------------------------
!                     COQOS-HV                   !
--------------------------------------------------

This works fine with

cangen -g 0 -i can0

on the driver side sending CAN messages to the device guest. No confirmation
is lost testing for several minutes.
Where's the driver side? On the host or the guest?

Both sides are guests of the hypervisor in our architecture. There is no 
host in this sense, COQOS-HV is a type 1 hypervisor. The hypervisor does 
not provide devices directly on its own, the devices are provided with 
the support of a device (provider) guest which is also only a guest of 
the hypervisor.

Have you activated SO_RXQ_OVFL?
With recvmsg() you get the number of dropped messages in the socket.
Have a look at:
https://github.com/linux-can/can-utils/blob/master/cansequence.c

I had no idea about SO_RXQ_OVFL. This looks to be useful to implement an 
emergency recovery mechanism not to get stuck. If detecting loss of 
received frames and the controller is still active and TX messages are 
pending for a too long time then marking the pending TX messages as used 
(done) to cope with the situation and not getting stuck (for too long). 
Might be acceptable if this was something which normally does not happen 
besides in really exceptional situations.

Nothing which should be done now, getting far too complicated for a 1st 
shot to implement a Virtio CAN device.

We don't have a feature flag to query if the Linux driver support proper
CAN echo on TX complete notification.

Not so nice. But the device integrator should know which backend is used 
and having a command line option for the device application the issue 
can be handled. Need the command line switch anyway now to do experiments.

Regards
Harald

--
Dipl.-Ing. Harald Mommer
Senior Software Engineer

OpenSynergy GmbH
Rotherstr. 20, 10245 Berlin

Phone:  +49 (30) 60 98 540-0 <== Zentrale
Fax:    +49 (30) 60 98 540-99
E-Mail:harald.mommer@xxxxxxxxxxxxxxx

www.opensynergy.com

Handelsregister: Amtsgericht Charlottenburg, HRB 108616B
Geschäftsführer/Managing Director: Regis Adjamah