On Mon, Aug 20, 2018 at 01:46:59PM +0000, Ermans, Brian (Ext.) wrote: > Hello Julia, > > Thank you for your reply. > > > > Changing thread priority or application priority does not change > > > behavior. > > > > Is this true even if ksoftirqd/1 is prioritized over the CAN thread? > > It took us some time, but good call, this actually seems to fix the issue. > > When we force ksoftirqd to FIFO and the CAN thread to NORMAL it does > not lock! (or at least, not within a reasonable time frame, will do > some long-term testing tonight) > > We will have to do extensive testing to verify this is a valid > work-around for us. We don't know if changing these priorities have > unforeseen side-effects at this point, but for now it seems to be > good. Making the CAN thread non-RT means that you've lost all latency guarantees. If that's okay for your application, then why was it RT in the first place? :) Looking at the traces you sent me, I'm not actually sure it's the relative prioritization of ksoftirqd and your CAN thread that is causing issues. I don't think the state of the ksoftirqd thread is relevant at all. The two relevant threads are the CAN thread and the xhci_hcd irqthread. When in the livelock scenario, all of your traces look like this, starting with the failed write(): CAN-2557 1...10 193.299273: sys_enter_write: fd: 0x000c, buf: 0x8ccc3c50, count: 0x0010 CAN-2557 1....0 193.299276: kmem_cache_alloc: (__alloc_skb+0x44) [FAILED TO PARSE] call_site=0xc1566b54 ptr=0xf5ad7e40 bytes_req=192 bytes_alloc=192 gfp_flags=37749952 CAN-2557 1....0 193.299277: kmalloc: (__alloc_skb+0x67) [FAILED TO PARSE] call_site=0xc1566b77 ptr=0xc47f2f00 bytes_req=256 bytes_alloc=256 gfp_flags=37816000 CAN-2557 1....1 193.299279: net_dev_queue: dev=can0 skbaddr=0xf5ad7e40 len=16 CAN-2557 1....4 193.299281: net_dev_start_xmit: dev=can0 queue_mapping=0 skbaddr=0xf5ad7e40 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x0000 protocol=0x000c ip_summed=1 len=16 data_len=0 network_offset=0 transport_offset_valid=1 transport_offset=0 tx_flags=0 gso_size=0 gso_segs=0 gso_type=0 CAN-2557 1....4 193.299282: net_dev_xmit: dev=can0 skbaddr=0xf5ad7e40 len=16 rc=16 CAN-2557 1d...4 193.299283: softirq_raise: vec=2 [action=NET_TX] CAN-2557 1....2 193.299284: softirq_entry: vec=2 [action=NET_TX] CAN-2557 1....4 193.299286: net_dev_start_xmit: dev=can0 queue_mapping=0 skbaddr=0xf5ad7e40 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x0000 protocol=0x000c ip_summed=1 len=16 data_len=0 network_offset=0 transport_offset_valid=1 transport_offset=0 tx_flags=0 gso_size=0 gso_segs=0 gso_type=0 CAN-2557 1....4 193.299286: net_dev_xmit: dev=can0 skbaddr=0xf5ad7e40 len=16 rc=16 CAN-2557 1d...4 193.299287: softirq_raise: vec=2 [action=NET_TX] CAN-2557 1....2 193.299288: softirq_exit: vec=2 [action=NET_TX] CAN-2557 1....2 193.299289: softirq_entry: vec=2 [action=NET_TX] CAN-2557 1....4 193.299290: net_dev_start_xmit: dev=can0 queue_mapping=0 skbaddr=0xf5ad7e40 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x0000 protocol=0x000c ip_summed=1 len=16 data_len=0 network_offset=0 transport_offset_valid=1 transport_offset=0 tx_flags=0 gso_size=0 gso_segs=0 gso_type=0 CAN-2557 1....4 193.299291: net_dev_xmit: dev=can0 skbaddr=0xf5ad7e40 len=16 rc=16 CAN-2557 1d...4 193.299291: softirq_raise: vec=2 [action=NET_TX] CAN-2557 1....2 193.299292: softirq_exit: vec=2 [action=NET_TX] CAN-2557 1....2 193.299293: softirq_entry: vec=2 [action=NET_TX] /* ... and so on, ad infinitum ... */ The thread queues a packet for transmission, raises the NET_TX softirq, which is executed in the context of this threads' outermost local_bh_enable(). This looping/reraising behavior appears to be qdisc_run(), which matches the backtraces you provided in your first email. qdisc_run() is "bounded" by either meeting a packet quota (which it wouldn't, because we aren't successfully sending packets), and need_resched(), which will be false if the CAN thread is the highest priority. So, around we go. The only thing which would make the TX succeed was if there was an available URB, but that won't happen until the xhci_hcd irqthread can free them, but this thread isn't being executed... What's preventing the xhci_hcd irqthread from executing? Well, for one, it's prioritized under the CAN thread, so it's starved for CPU time on the same CPU. This is a failure of properly prioritizing dependent threads (see my proposed solution below). A follow up question might be: well, I have two CPUs, why isn't the xhci_hcd irqthread being scheduled on the other CPU?? And the answer is: the irqthread was preempted when it was amidst a migrate-disable region. This also explains why this situation appears to work a lot of the time (when the irqthread can be moved), but fails sometimes (when it's been preempted in a migrate-disable region). A solution: prioritize the xhci_hcd irqthread above your CAN thread (or at least == w/ SCHED_RR). Note that this makes you susceptible to latencies induced by irrelevant traffic on the USB bus, but if you're controlling the environment otherwise this might be fine. The only thing that's unclear to me about this situation is why the devices' queue hasn't been stopped when there wasn't any URBs left... > We have confirmed this issue does not occur when using a mPCIe device > from Peak. User space code is the same as we still write to can0. I > failed to mention this before as we didn't test this yet at the time I > wrote the original mail. Well, PCI is a completely different beast entirely, so this is no surprise to me. Julia