Re: Kernel locks after consecutive writes returning ENOBUFS over can0

Julia Cartwright <julia@xxxxxx> · Wed, 22 Aug 2018 10:15:54 -0500

On Wed, Aug 22, 2018 at 02:23:49PM +0000, Ermans, Brian (Ext.) wrote:
> Hello Julia,
> 
> > > > > Changing thread priority or application priority does not change
> > > > > behavior.
> > > >
> > > > Is this true even if ksoftirqd/1 is prioritized over the CAN thread?
> > >
> > > It took us some time, but good call, this actually seems to fix the issue.
> > >
> > > When we force ksoftirqd to FIFO and the CAN thread to NORMAL it does
> > > not lock! (or at least, not within a reasonable time frame, will do
> > > some long-term testing tonight)
> > >
> > > We will have to do extensive testing to verify this is a valid
> > > work-around for us. We don't know if changing these priorities have
> > > unforeseen side-effects at this point, but for now it seems to be
> > > good.
> >
> > Making the CAN thread non-RT means that you've lost all latency
> > guarantees.  If that's okay for your application, then why was it RT in
> > the first place? :)
> 
> I can explain to the user that we have less latency guarantees
> regarding the CAN output. I cannot explain to them that the system
> locks. If it was a situation where we couldn't use CAN thread as
> SCHED_FIFO if we wanted a non-locking kernel, that would be a valid
> trade-off.
> 
> Turns out however we can have the best of both worlds. With insights
> from your detailed information we forced CAN thread to SCHED_FIFO prio
> 1 and confirmed it does not lock, as xhci_hcd has 50 (by default). CAN
> used to be SCHED_FIFO prio 60 originally.  I have done extensive
> testing during the day and will leave it running for a couple nights
> to verify it keeps running. Your extensive explanation makes us very
> confident that the issue is fixed for us now.
> 
> We wonder if we could have prevented this, as our current set-up was
> done incorrectly, or that we actually found such an obscure edge-case
> that we found a kernel bug...

Well, I alluded to this at the tail end of my last email.  Failure to
properly start transmit of a packet should result in the transmit queue
being stopped.  This should prevent the NET_TX softirq from being
re-raised until the URB completion callback is run.  I'm not an expert
on networking, though, so you might have to chat with the SocketCAN and
Peak folks.

This re-raising continuously "works" on mainline because a re-raise from
softirq context unconditionally defers execution to ksoftirqd. On RT,
the outermost local_bh_enable() will keep executing raised softirqs
until there are no longer any pending.

   Julia