Re: Kernel locks after consecutive writes returning ENOBUFS over can0

Julia Cartwright <julia@xxxxxx> · Tue, 21 Aug 2018 16:31:36 -0500

On Mon, Aug 20, 2018 at 01:46:59PM +0000, Ermans, Brian (Ext.) wrote:
> Hello Julia,
>
> Thank you for your reply.
>
> > > Changing thread priority or application priority does not change
> > > behavior.
> >
> > Is this true even if ksoftirqd/1 is prioritized over the CAN thread?
>
> It took us some time, but good call, this actually seems to fix the issue.
>
> When we force ksoftirqd to FIFO and the CAN thread to NORMAL it does
> not lock! (or at least, not within a reasonable time frame, will do
> some long-term testing tonight)
>
> We will have to do extensive testing to verify this is a valid
> work-around for us. We don't know if changing these priorities have
> unforeseen side-effects at this point, but for now it seems to be
> good.

Making the CAN thread non-RT means that you've lost all latency
guarantees.  If that's okay for your application, then why was it RT in
the first place? :)

Looking at the traces you sent me, I'm not actually sure it's the
relative prioritization of ksoftirqd and your CAN thread that is causing
issues.  I don't think the state of the ksoftirqd thread is relevant at
all.  The two relevant threads are the CAN thread and the xhci_hcd
irqthread.

When in the livelock scenario, all of your traces look like this,
starting with the failed write():

     CAN-2557    1...10   193.299273: sys_enter_write:      fd: 0x000c, buf: 0x8ccc3c50, count: 0x0010
     CAN-2557    1....0   193.299276: kmem_cache_alloc:     (__alloc_skb+0x44) [FAILED TO PARSE] call_site=0xc1566b54 ptr=0xf5ad7e40 bytes_req=192 bytes_alloc=192 gfp_flags=37749952
     CAN-2557    1....0   193.299277: kmalloc:              (__alloc_skb+0x67) [FAILED TO PARSE] call_site=0xc1566b77 ptr=0xc47f2f00 bytes_req=256 bytes_alloc=256 gfp_flags=37816000
     CAN-2557    1....1   193.299279: net_dev_queue:        dev=can0 skbaddr=0xf5ad7e40 len=16
     CAN-2557    1....4   193.299281: net_dev_start_xmit:   dev=can0 queue_mapping=0 skbaddr=0xf5ad7e40 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x0000 protocol=0x000c ip_summed=1 len=16 data_len=0 network_offset=0 transport_offset_valid=1 transport_offset=0 tx_flags=0 gso_size=0 gso_segs=0 gso_type=0
     CAN-2557    1....4   193.299282: net_dev_xmit:         dev=can0 skbaddr=0xf5ad7e40 len=16 rc=16
     CAN-2557    1d...4   193.299283: softirq_raise:        vec=2 [action=NET_TX]
     CAN-2557    1....2   193.299284: softirq_entry:        vec=2 [action=NET_TX]
     CAN-2557    1....4   193.299286: net_dev_start_xmit:   dev=can0 queue_mapping=0 skbaddr=0xf5ad7e40 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x0000 protocol=0x000c ip_summed=1 len=16 data_len=0 network_offset=0 transport_offset_valid=1 transport_offset=0 tx_flags=0 gso_size=0 gso_segs=0 gso_type=0
     CAN-2557    1....4   193.299286: net_dev_xmit:         dev=can0 skbaddr=0xf5ad7e40 len=16 rc=16
     CAN-2557    1d...4   193.299287: softirq_raise:        vec=2 [action=NET_TX]
     CAN-2557    1....2   193.299288: softirq_exit:         vec=2 [action=NET_TX]
     CAN-2557    1....2   193.299289: softirq_entry:        vec=2 [action=NET_TX]
     CAN-2557    1....4   193.299290: net_dev_start_xmit:   dev=can0 queue_mapping=0 skbaddr=0xf5ad7e40 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x0000 protocol=0x000c ip_summed=1 len=16 data_len=0 network_offset=0 transport_offset_valid=1 transport_offset=0 tx_flags=0 gso_size=0 gso_segs=0 gso_type=0
     CAN-2557    1....4   193.299291: net_dev_xmit:         dev=can0 skbaddr=0xf5ad7e40 len=16 rc=16
     CAN-2557    1d...4   193.299291: softirq_raise:        vec=2 [action=NET_TX]
     CAN-2557    1....2   193.299292: softirq_exit:         vec=2 [action=NET_TX]
     CAN-2557    1....2   193.299293: softirq_entry:        vec=2 [action=NET_TX]
     /* ... and so on, ad infinitum ... */

The thread queues a packet for transmission, raises the NET_TX softirq,
which is executed in the context of this threads' outermost
local_bh_enable().  This looping/reraising behavior appears to be
qdisc_run(), which matches the backtraces you provided in your first
email.

qdisc_run() is "bounded" by either meeting a packet quota (which it
wouldn't, because we aren't successfully sending packets), and
need_resched(), which will be false if the CAN thread is the highest
priority.  So, around we go.

The only thing which would make the TX succeed was if there was an
available URB, but that won't happen until the xhci_hcd irqthread can
free them, but this thread isn't being executed...

What's preventing the xhci_hcd irqthread from executing?  Well, for one,
it's prioritized under the CAN thread, so it's starved for CPU time on
the same CPU.  This is a failure of properly prioritizing dependent
threads (see my proposed solution below).

A follow up question might be: well, I have two CPUs, why isn't the
xhci_hcd irqthread being scheduled on the other CPU??  And the answer
is: the irqthread was preempted when it was amidst a migrate-disable
region.

This also explains why this situation appears to work a lot of the time
(when the irqthread can be moved), but fails sometimes (when it's been
preempted in a migrate-disable region).

A solution: prioritize the xhci_hcd irqthread above your CAN thread (or
at least == w/ SCHED_RR).  Note that this makes you susceptible to
latencies induced by irrelevant traffic on the USB bus, but if you're
controlling the environment otherwise this might be fine.

The only thing that's unclear to me about this situation is why the
devices' queue hasn't been stopped when there wasn't any URBs
left...

> We have confirmed this issue does not occur when using a mPCIe device
> from Peak. User space code is the same as we still write to can0.  I
> failed to mention this before as we didn't test this yet at the time I
> wrote the original mail.

Well, PCI is a completely different beast entirely, so this is no
surprise to me.

   Julia