Re: j1939: control messages and PGN

David Jander <david@xxxxxxxxxxx> · Wed, 29 May 2019 09:17:10 +0200

On Wed, 29 May 2019 07:04:42 +0200
Oleksij Rempel <o.rempel@xxxxxxxxxxxxxx> wrote:

> On Tue, May 28, 2019 at 04:48:03PM +0200, Kurt Van Dijck wrote:
> > On di, 28 mei 2019 16:27:57 +0200, David Jander wrote:  
> > > On Tue, 28 May 2019 15:13:44 +0200
> > > Kurt Van Dijck <dev.kurt@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> > >   
> > > > On di, 28 mei 2019 13:54:35 +0200, Oleksij Rempel wrote:  
> > > > > Hi all,
> > > > > 
> > > > > when receiving j1939 control messages, the current code looks up the
> > > > > session by DA and SA only, not taking the PGN (which is part of the
> > > > > control messages' data) into account.
> > > > > 
> > > > > When it comes to error control messages the session is aborted, even if
> > > > > the PGN doesn't match. In EOMA the session is aborted, too. This means
> > > > > receiving control messages with non matching PGNs lead to session abort.
> > > > > 
> > > > > Is this in general a good behavior?  
> > > 
> > > Tl;DR: Yes, if PGN does not match the (E)TP session should be aborted.
> > >   
> > > > It is indeed a bit stupid.
> > > > 
> > > > If 2 ends are talking to each other, and 1 of those 2 talks about
> > > > something else, that implies that you talk not about the same thing
> > > > either, and you probably want to abort soon. It would be better if
> > > > you only abort 'probably soon' and not 'immediately' in such case, since
> > > > you're right that reception of another PGN control frame does not imply
> > > > that you're current session became invalid.  
> > > 
> > > If the session is with me and I see conflicting PGN (not start of a new
> > > session), why not send an abort immediately?
> > >   
> > > > In j1939 however, the data part does not carry PGN info, since only 1
> > > > session can be open. This implies 2 things:
> > > > * Ignoring PGN difference in control frames makes you blind to the data
> > > >   consistency, so you may think in certain cases that you continue to receive
> > > >   data while it's actually the data that belongs to another PGN.  
> > > 
> > > The sender cannot start sending data without the receiver acknowledging that
> > > with a CTS message first. If the CTS contains a different PGN than the RTS,
> > > then the sender should abort immediately.
> > > For ETP, if the receiver sees a DPO frame with a different PGN, it should also
> > > send an abort for that PGN immediately.
> > >   
> > > > * If node A talks to node B on a different PGN that node B thinks that
> > > >   node A is talking, then this is AFAIK considered as protocol
> > > >   violation because you risk data corruption.
> > > > 
> > > > The PGN in all-but-1st control frame could be considered redundant, but
> > > > since it's there, it should match.  
> > > 
> > > Ack.
> > >   
> > > > So, it's still not a good behaviour, but j1939 IMHO requires you to do so.
> > > > 
> > > > So you think this is bad, let's make it even worse :-)
> > > > Between 2 nodes, actually 2 sessions may exist, 1 recv & 1 send.  
> > > 
> > > Actually four in this case: ETPrx, TPrx, ETPtx and ETPrx, right?  
> > 
> > Right, I tend to forget ETP for simplicity :-)  
> > >   
> > > > Still, control frames that to RTS, CTS, DPO, ... are uni-directional,
> > > > i.e. they map to only 1 of those 2 sessions exclusively.
> > > > This is not the case for an abort message.
> > > > _If I'm not mistaken_, the PGN info should be ignored for abort frames,
> > > > since it may be unclear what exactly you abort: a old PGN, or a newly
> > > > requested PGN. And due to that, it's also unclear if it applies to the
> > > > send or recv path, so you abort, AFAIK, both directions at once.
> > > > But I have not the specifications around now, I can't verify.  
> > > 
> > > The abort message also contains the PGN of the packeted message, so AFAICS, you
> > > can abort any one specific of the 4 theoretically simultaneous sessions,
> > > because the should have different PGN's for the different directions (rx/tx).
> > > 
> > > That's probably one of the reasons why there is always a different PGN used to
> > > talk in one direction than in the other.  
> > I see.
> > I did not implement that very nice, I think.  
> 
> Let's take some example to make me better understand all possible
> scenarios.
> We have node 0x80 and node 0x90:
> - 0x80 is transmitting data to the 0x90 with RTS PGN 0x12300
> - if 0x90 get control signal from 0x80 (DPO) with PGN 0x13300, 0x90 should
>   send abort message to 0x13300 and cancel currently running 0x80->0x90,0x12300
>   session.

I think that's ok. Unsure if canceling 0x12300 is the right thing to do though.
If the DPO came from a confused 3rd ECU, then the session from the _real_ 0x80
would still have a chance for success. Why cancel it? If 0x80 is
off-the-rails, it will otherwise just timeout and abort anyway, right?

Btw, the "reason" (byte 2 (index 1)) for this abort message would be:
"10: Unexpected EDPO PGN (PGN in EDPO is bad)"
(source: ISO/DIS 11783‐3:2017(E) page 42, chapter 5.11.4.6, table 5.9)
Note that this is specified for ETP only. For TP, the abort reason is not
defined in this case. I think we should use the same as for ETP.

> - if 0x80 get control signal from 0x90 (CTS, EOMA) with PGN 0x13300, 0x80 should
>   send abort message to 0x90 0x13300 and cancel currently running 0x80->0x90,0x12300
>   session.

Again, unsure if canceling 0x12300 is necessary/desired. The abort message is
correct IMO though.

"Reason" for bad PGN in ECTS: 14
"Reason" for bad PGN in EOMA: not specified (==> 250)?
(again, only defined for ETP in this version of the standard).

> - if 0x80 and 0x90 will get abort signal for 0x80->0x90,0x13300, which
>   was send by 0x80 or evil third ECU, currently running 0x80->0x90,0x12300
>   session should not be aborted.

Ack.

> Correct?
> 
>   What is about not related  control signals. For 0x90 - CTS, EOMA; and
>   for 0x80 - DPO?
>   I ask because this stack has loop back design, so 0x90 and 0x80 will get own
>   signals as well.
> 
> I can imagine at least some reason why we can get wrong signals:
> - address conflicts (multiple ECUs configured with same address)
> - buggy software 
> - some CAN bus issues
> - malicious attempts to exploit ECU remotely. 

Nice summary. Although due to the design of CAN and J1939, if a malicious ECU
has physical access to the CAN bus, it is game-over for the whole system
anyway. No point in even attempting to thwart an attack.
But as for resisting bugs and assuring best-effort in maintaining a working
system despite buggy/bogus messages/nodes, I agree.

Best regards,

-- 
David Jander
Protonic Holland.