On Wed, Aug 22, 2018 at 02:23:49PM +0000, Ermans, Brian (Ext.) wrote: > Hello Julia, > > > > > > Changing thread priority or application priority does not change > > > > > behavior. > > > > > > > > Is this true even if ksoftirqd/1 is prioritized over the CAN thread? > > > > > > It took us some time, but good call, this actually seems to fix the issue. > > > > > > When we force ksoftirqd to FIFO and the CAN thread to NORMAL it does > > > not lock! (or at least, not within a reasonable time frame, will do > > > some long-term testing tonight) > > > > > > We will have to do extensive testing to verify this is a valid > > > work-around for us. We don't know if changing these priorities have > > > unforeseen side-effects at this point, but for now it seems to be > > > good. > > > > Making the CAN thread non-RT means that you've lost all latency > > guarantees. If that's okay for your application, then why was it RT in > > the first place? :) > > I can explain to the user that we have less latency guarantees > regarding the CAN output. I cannot explain to them that the system > locks. If it was a situation where we couldn't use CAN thread as > SCHED_FIFO if we wanted a non-locking kernel, that would be a valid > trade-off. > > Turns out however we can have the best of both worlds. With insights > from your detailed information we forced CAN thread to SCHED_FIFO prio > 1 and confirmed it does not lock, as xhci_hcd has 50 (by default). CAN > used to be SCHED_FIFO prio 60 originally. I have done extensive > testing during the day and will leave it running for a couple nights > to verify it keeps running. Your extensive explanation makes us very > confident that the issue is fixed for us now. > > We wonder if we could have prevented this, as our current set-up was > done incorrectly, or that we actually found such an obscure edge-case > that we found a kernel bug... Well, I alluded to this at the tail end of my last email. Failure to properly start transmit of a packet should result in the transmit queue being stopped. This should prevent the NET_TX softirq from being re-raised until the URB completion callback is run. I'm not an expert on networking, though, so you might have to chat with the SocketCAN and Peak folks. This re-raising continuously "works" on mainline because a re-raise from softirq context unconditionally defers execution to ksoftirqd. On RT, the outermost local_bh_enable() will keep executing raised softirqs until there are no longer any pending. Julia