Re: Bug: lock problem for the function of_find_node_by_name

Rob Herring <robh@xxxxxxxxxx> · Wed, 12 Mar 2025 09:19:57 -0500

On Tue, Mar 11, 2025 at 8:41 PM Ryder Wang <rydercoding@xxxxxxxxxxx> wrote:
>
> Hi Rob,
>
> Thanks for your reply.

Please don't top post.

> This issue occurred on some embedded ARM system for some device driver which called of_find_node_by_name. Below is the kernel log including the call stack:
>
>     [  650.456107][ T3481] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1637
>     [  650.465589][ T3481] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 3481, name: kworker/0:0
>     [  650.474970][ T3481] Preemption disabled at:
>     [  650.474976][ T3481] [<ffffffd36bb03118>] of_find_node_by_name+0x2c/0x124
>     [  650.486191][ T3481] CPU: 0 PID: 3481 Comm: kworker/0:0 Tainted: G           OE     5.15.149-debug-gc1dc9fe4253b-dirty #1
>     [  650.486208][ T3481] Hardware name: xxxxxxxxxxxxxxxxxxxxxxxxxx
>     [  650.486219][ T3481] Workqueue: events_power_efficient phylink_resolve
>     [  650.486244][ T3481] Call trace:
>     [  650.486249][ T3481]  dump_backtrace+0x0/0x214
>     [  650.486271][ T3481]  show_stack+0x18/0x24
>     [  650.486287][ T3481]  dump_stack_lvl+0x64/0x7c
>     [  650.486305][ T3481]  dump_stack+0x18/0x38
>     [  650.486319][ T3481]  ___might_sleep+0x15c/0x180
>     [  650.486336][ T3481]  __might_sleep+0x50/0x84
>     [  650.486348][ T3481]  down_write+0x28/0x54
>     [  650.486364][ T3481]  kernfs_remove+0x38/0x58
>     [  650.486381][ T3481]  sysfs_remove_dir+0x54/0x70
>     [  650.486396][ T3481]  __kobject_del+0x50/0xe8
>     [  650.486413][ T3481]  kobject_cleanup+0x58/0x1e4
>     [  650.486427][ T3481]  kobject_put+0x64/0xb0
>     [  650.486439][ T3481]  of_node_put+0x1c/0x28
>     [  650.486454][ T3481]  of_find_node_by_name+0x74/0x124
>     [  650.486466][ T3481]  ethqos_configure_mac_v4+0x13b0/0x1750

Not a function in mainline...

The assumption with of_find_node_by_name and all the dt functions that
operate as iterators is you do a get on the 1st node before calling
the 1st time, and then they all do a get on the next node and a put on
the previous node. We could move the put out of the spinlock, but then
you might not find the bug in the caller. Also, all the iterator
functions do the same thing.

One thing I noticed is for_each_of_allnodes_from() is not safe to call
outside the spinlock and we have 1 user doing that
(drivers/clk/ti/clk.c).

Rob