firewire-ohci: device rediscovery hardlock regression

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I'd like to report a regression in firewire-ohci that results
in the kernel hardlocking when re-discovering a FireWire device.

TI XIO2213B
RME FireFace 800

It will occur under three conditions:
 * power-cycling the FireWire device
 * un- and re-plugging the FireWire device
 * suspending and then waking the PC

Often it would also occur directly on boot in QEMU but I have not
yet observed this specific behavior on bare metal.

Here is an excerpt from the stack trace (don't know whether it is
acceptable to send in full):

kernel: ------------[ cut here ]------------
kernel: refcount_t: addition on 0; use-after-free.
kernel: WARNING: CPU: 3 PID: 116 at lib/refcount.c:25
refcount_warn_saturate (/build/linux/lib/refcount.c:25 (discriminator
1)) 
kernel: Workqueue: firewire_ohci bus_reset_work
kernel: RIP: 0010:refcount_warn_saturate
(/build/linux/lib/refcount.c:25 (discriminator 1)) 
kernel: Call Trace:
kernel:  <TASK>
kernel: ? refcount_warn_saturate (/build/linux/lib/refcount.c:25
(discriminator 1)) 
kernel: ? __warn.cold (/build/linux/kernel/panic.c:693) 
kernel: ? refcount_warn_saturate (/build/linux/lib/refcount.c:25
(discriminator 1)) 
kernel: ? report_bug (/build/linux/lib/bug.c:180
/build/linux/lib/bug.c:219) 
kernel: ? handle_bug (/build/linux/arch/x86/kernel/traps.c:218) 
kernel: ? exc_invalid_op (/build/linux/arch/x86/kernel/traps.c:260
(discriminator 1)) 
kernel: ? asm_exc_invalid_op
(/build/linux/./arch/x86/include/asm/idtentry.h:621) 
kernel: ? refcount_warn_saturate (/build/linux/lib/refcount.c:25
(discriminator 1)) 
kernel: for_each_fw_node (/build/linux/./include/linux/refcount.h:190
/build/linux/./include/linux/refcount.h:241
/build/linux/./include/linux/refcount.h:258
/build/linux/drivers/firewire/core.h:199
/build/linux/drivers/firewire/core-topology.c:275) 
kernel: ? __pfx_report_found_node (/build/linux/drivers/firewire/core-
topology.c:312) 
kernel: fw_core_handle_bus_reset (/build/linux/drivers/firewire/core-
topology.c:399 (discriminator 1) /build/linux/drivers/firewire/core-
topology.c:504 (discriminator 1)) 
kernel: bus_reset_work (/build/linux/drivers/firewire/ohci.c:2121) 
kernel: process_one_work
(/build/linux/./arch/x86/include/asm/jump_label.h:27
/build/linux/./include/linux/jump_label.h:207
/build/linux/./include/trace/events/workqueue.h:110
/build/linux/kernel/workqueue.c:3236) 
kernel: worker_thread (/build/linux/kernel/workqueue.c:3306
(discriminator 2) /build/linux/kernel/workqueue.c:3393 (discriminator
2)) 
kernel: ? __pfx_worker_thread (/build/linux/kernel/workqueue.c:3339) 
kernel: kthread (/build/linux/kernel/kthread.c:389) 
kernel: ? __pfx_kthread (/build/linux/kernel/kthread.c:342) 
kernel: ret_from_fork (/build/linux/arch/x86/kernel/process.c:153) 
kernel: ? __pfx_kthread (/build/linux/kernel/kthread.c:342) 
kernel: ret_from_fork_asm (/build/linux/arch/x86/entry/entry_64.S:254) 
kernel:  </TASK>

I have identified the commit via bisection:
24b7f8e5cd656196a13077e160aec45ad89b58d9
firewire: core: use helper functions for self ID sequence

It was part of the following patch series:
firewire: add tracepoints events for self ID sequence
https://lore.kernel.org/all/20240605235155.116468-6-o-takashi@xxxxxxxxxxxxx/

#regzbot introduced: 24b7f8e5cd65

Since this was before v6.10-rc5 and stable 6.10.14 is EOL,
stable v6.11.5 and mainline are affected.

Reversion appears to be non-trivial as it is part of a patch
series, other files have been altered as well and other commits
build on top of it.

Call chain:
core-topology.c fw_core_handle_bus_reset()
-> core-topology.c   for_each_fw_node(card, local_node,
report_found_node)
-> core.h            fw_node_get(root)
-> refcount.h        __refcount_inc(&node)
-> refcount.h        __refcount_add(1, r, oldp);
-> refcount.h        refcount_warn_saturate(r, REFCOUNT_ADD_UAF);
-> refcount.h        REFCOUNT_WARN("addition on 0; use-after-free")

Since local_node of fw_core_handle_bus_reset() is retrieved by
	local_node = build_tree(card, self_ids, self_id_count);
build_tree() needs to be looked at, it was indeed altered by
24b7f8e5cd65.

After a hard 3 hour look traversing all used functions and comparing
against the original function (as of e404cacfc5ed), this caught my eye:
       for (port_index = 0; port_index < total_port_count;
++port_index) {
               switch (port_status) {
               case PHY_PACKET_SELF_ID_PORT_STATUS_PARENT:
                       node->color = i;

In both for loops, "port_index" was replaced by "i"
"i" remains in use above:
       for (i = 0, h = &stack; i < child_port_count; i++)
               h = h->prev;

While the original also used the less descriptive i in the loop
       for (i = 0; i < port_count; i++) {
               switch (get_port_type(sid, i)) {
               case SELFID_PORT_PARENT:
                        node->color = i;
but reset it to 0 at the beginning of the loop.

So the stray "i" in the for loop should be replaced with the loop
iterator "port_index" as it is meant to be synchronous with the
loop iterator (i.e. the port_index), no?

diff --git a/drivers/firewire/core-topology.c b/drivers/firewire/core-
topology.c
index 8c10f47cc8fc..7fd91ba9c9c4 100644
--- a/drivers/firewire/core-topology.c
+++ b/drivers/firewire/core-topology.c
@@ -207,7 +207,7 @@ static struct fw_node *build_tree(struct fw_card
*card, const u32 *sid, int self
                                // the node->ports array where the
parent node should be.  Later,
                                // when we handle the parent node, we
fix up the reference.
                                ++parent_count;
-                               node->color = i;
+                               node->color = port_index;
                                break;

What threw me off was discaridng node->color as it would be replaced
later anyways (can't be important!), or so I thought.

Please tell me, is this line of reasoning correct or am I missing
something?

Compiling 24b7f8e5cd65 and later mainline with the patch above
resulted in a kernel that didn't crash!

In case my solution should turn out to be correct, I will gladly
submit the patch.

Kind regards,
Edmund Raile.







[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux