RE: Kernel locks after consecutive writes returning ENOBUFS over can0

"Ermans, Brian (Ext.)" <brian.ermans@xxxxxxx> · Wed, 22 Aug 2018 14:23:49 +0000

Hello Julia,

> > > > Changing thread priority or application priority does not change
> > > > behavior.
> > >
> > > Is this true even if ksoftirqd/1 is prioritized over the CAN thread?
> >
> > It took us some time, but good call, this actually seems to fix the issue.
> >
> > When we force ksoftirqd to FIFO and the CAN thread to NORMAL it does
> > not lock! (or at least, not within a reasonable time frame, will do
> > some long-term testing tonight)
> >
> > We will have to do extensive testing to verify this is a valid
> > work-around for us. We don't know if changing these priorities have
> > unforeseen side-effects at this point, but for now it seems to be
> > good.
>
> Making the CAN thread non-RT means that you've lost all latency
> guarantees.  If that's okay for your application, then why was it RT in
> the first place? :)

I can explain to the user that we have less latency guarantees regarding the CAN output. I cannot explain to them that the system locks. If it was a situation where we couldn't use CAN thread as SCHED_FIFO if we wanted a non-locking kernel, that would be a valid trade-off.

Turns out however we can have the best of both worlds. With insights from your detailed information we forced CAN thread to SCHED_FIFO prio 1 and confirmed it does not lock, as xhci_hcd has 50 (by default). CAN used to be SCHED_FIFO prio 60 originally.
I have done extensive testing during the day and will leave it running for a couple nights to verify it keeps running. Your extensive explanation makes us very confident that the issue is fixed for us now.

We wonder if we could have prevented this, as our current set-up was done incorrectly, or that we actually found such an obscure edge-case that we found a kernel bug...

Thanks again for your investigation and help.

Brian

>
> Looking at the traces you sent me, I'm not actually sure it's the
> relative prioritization of ksoftirqd and your CAN thread that is causing
> issues.  I don't think the state of the ksoftirqd thread is relevant at
> all.  The two relevant threads are the CAN thread and the xhci_hcd
> irqthread.
>
> When in the livelock scenario, all of your traces look like this,
> starting with the failed write():
>
>      CAN-2557    1...10   193.299273: sys_enter_write:      fd: 0x000c, buf:
> 0x8ccc3c50, count: 0x0010
>      CAN-2557    1....0   193.299276: kmem_cache_alloc:
> (__alloc_skb+0x44) [FAILED TO PARSE] call_site=0xc1566b54
> ptr=0xf5ad7e40 bytes_req=192 bytes_alloc=192 gfp_flags=37749952
>      CAN-2557    1....0   193.299277: kmalloc:              (__alloc_skb+0x67)
> [FAILED TO PARSE] call_site=0xc1566b77 ptr=0xc47f2f00 bytes_req=256
> bytes_alloc=256 gfp_flags=37816000
>      CAN-2557    1....1   193.299279: net_dev_queue:        dev=can0
> skbaddr=0xf5ad7e40 len=16
>      CAN-2557    1....4   193.299281: net_dev_start_xmit:   dev=can0
> queue_mapping=0 skbaddr=0xf5ad7e40 vlan_tagged=0 vlan_proto=0x0000
> vlan_tci=0x0000 protocol=0x000c ip_summed=1 len=16 data_len=0
> network_offset=0 transport_offset_valid=1 transport_offset=0 tx_flags=0
> gso_size=0 gso_segs=0 gso_type=0
>      CAN-2557    1....4   193.299282: net_dev_xmit:         dev=can0
> skbaddr=0xf5ad7e40 len=16 rc=16
>      CAN-2557    1d...4   193.299283: softirq_raise:        vec=2 [action=NET_TX]
>      CAN-2557    1....2   193.299284: softirq_entry:        vec=2 [action=NET_TX]
>      CAN-2557    1....4   193.299286: net_dev_start_xmit:   dev=can0
> queue_mapping=0 skbaddr=0xf5ad7e40 vlan_tagged=0 vlan_proto=0x0000
> vlan_tci=0x0000 protocol=0x000c ip_summed=1 len=16 data_len=0
> network_offset=0 transport_offset_valid=1 transport_offset=0 tx_flags=0
> gso_size=0 gso_segs=0 gso_type=0
>      CAN-2557    1....4   193.299286: net_dev_xmit:         dev=can0
> skbaddr=0xf5ad7e40 len=16 rc=16
>      CAN-2557    1d...4   193.299287: softirq_raise:        vec=2 [action=NET_TX]
>      CAN-2557    1....2   193.299288: softirq_exit:         vec=2 [action=NET_TX]
>      CAN-2557    1....2   193.299289: softirq_entry:        vec=2 [action=NET_TX]
>      CAN-2557    1....4   193.299290: net_dev_start_xmit:   dev=can0
> queue_mapping=0 skbaddr=0xf5ad7e40 vlan_tagged=0 vlan_proto=0x0000
> vlan_tci=0x0000 protocol=0x000c ip_summed=1 len=16 data_len=0
> network_offset=0 transport_offset_valid=1 transport_offset=0 tx_flags=0
> gso_size=0 gso_segs=0 gso_type=0
>      CAN-2557    1....4   193.299291: net_dev_xmit:         dev=can0
> skbaddr=0xf5ad7e40 len=16 rc=16
>      CAN-2557    1d...4   193.299291: softirq_raise:        vec=2 [action=NET_TX]
>      CAN-2557    1....2   193.299292: softirq_exit:         vec=2 [action=NET_TX]
>      CAN-2557    1....2   193.299293: softirq_entry:        vec=2 [action=NET_TX]
>      /* ... and so on, ad infinitum ... */
>
> The thread queues a packet for transmission, raises the NET_TX softirq,
> which is executed in the context of this threads' outermost
> local_bh_enable().  This looping/reraising behavior appears to be
> qdisc_run(), which matches the backtraces you provided in your first
> email.
>
> qdisc_run() is "bounded" by either meeting a packet quota (which it
> wouldn't, because we aren't successfully sending packets), and
> need_resched(), which will be false if the CAN thread is the highest
> priority.  So, around we go.
>
> The only thing which would make the TX succeed was if there was an
> available URB, but that won't happen until the xhci_hcd irqthread can
> free them, but this thread isn't being executed...
>
> What's preventing the xhci_hcd irqthread from executing?  Well, for one,
> it's prioritized under the CAN thread, so it's starved for CPU time on
> the same CPU.  This is a failure of properly prioritizing dependent
> threads (see my proposed solution below).
>
> A follow up question might be: well, I have two CPUs, why isn't the
> xhci_hcd irqthread being scheduled on the other CPU??  And the answer
> is: the irqthread was preempted when it was amidst a migrate-disable
> region.
>
> This also explains why this situation appears to work a lot of the time
> (when the irqthread can be moved), but fails sometimes (when it's been
> preempted in a migrate-disable region).
>
> A solution: prioritize the xhci_hcd irqthread above your CAN thread (or
> at least == w/ SCHED_RR).  Note that this makes you susceptible to
> latencies induced by irrelevant traffic on the USB bus, but if you're
> controlling the environment otherwise this might be fine.
>
> The only thing that's unclear to me about this situation is why the
> devices' queue hasn't been stopped when there wasn't any URBs
> left...
>
> > We have confirmed this issue does not occur when using a mPCIe device
> > from Peak. User space code is the same as we still write to can0.  I
> > failed to mention this before as we didn't test this yet at the time I
> > wrote the original mail.
>
> Well, PCI is a completely different beast entirely, so this is no
> surprise to me.
>
>    Julia

HBM Netherlands B.V., Schutweg 15a, NL-5145 NP Waalwijk, The Netherlands | www.hbm.com

Registered as B.V. (Dutch limited liability company) in the Dutch commercial register 08183075 0000
Company domiciled in Waalwijk | Managing Directors : Ben Keetman, Peter Ackermans

The information in this email is confidential. It is intended solely for the addressee. If you are not the intended recipient, please let me know and delete this email.