On Fri, Feb 7, 2025 at 9:37 AM Laurent Pinchart <laurent.pinchart@xxxxxxxxxxxxxxxx> wrote: > > Hi Zekun, > > On Fri, Feb 07, 2025 at 07:28:23PM +0800, zhangzekun (A) wrote: > > 在 2025/2/7 16:24, Oleksij Rempel 写道: > > > On Fri, Feb 07, 2025 at 09:31:09AM +0800, Zhang Zekun wrote: > > >> There are many drivers use of_find_node_by_name() with a not-NULL > > >> device_node pointer, and a number of callers would require a call to > > >> of_node_get() before using it. There are also some drivers who forget > > >> to call of_node_get() which would cause a ref count leak[1]. So, Add a > > >> wraper function for of_find_node_by_name(), drivers may use this function > > >> to call of_find_node_by_name() with the refcount already balanced. > > >> > > >> [1] https://lore.kernel.org/all/20241024015909.58654-1-zhangzekun11@xxxxxxxxxx/ > > > > > > Hi Zhang Zekun, > > > > > > thank you for working on this issue! > > > > > > First of all, let's take a step back and analyze the initial problem. > > > Everything following is only my opinion... > > > > > > The main issue I see is that the current API - of_find_node_by_name - > > > modifies the refcount of its input by calling of_node_put(from) as part > > > of its search. Typically, a "find" function is expected to treat its > > > input as read-only. That is, when you pass an object into such a > > > function, you expect its reference count to remain unchanged unless > > > ownership is explicitly transferred. In this case, lowering the refcount > > > on the input node is counterintuitive and already lead to unexpected > > > behavior and subtle bugs. > > > > > > To address this, the workaround introduces a wrapper function, > > > of_find_node_by_name_balanced, which first increments the input’s > > > refcount (via of_node_get()) before calling the original function. While > > > this "balances" the refcount change, the naming remains problematic from > > > my perspective. The "_balanced" suffix isn’t part of our common naming > > > conventions (traditions? :)). Most drivers expect that a function > > > starting with "find" will not alter the reference count of its input. > > > The term "balanced" doesn’t clearly convey that the input's refcount is > > > being explicitly managed - it instead obscures the underlying behavior, > > > leaving many developers confused about what guarantees the API provides. > > > > > > In my view, a more natural solution would be to redesign the API so that > > > it doesn’t modify the input object’s refcount at all. Instead, it should > > > solely increase the refcount of the returned node (if found) for safe > > > asynchronous usage. This approach would align with established > > > conventions where "find" implies no side effects on inputs or output, > > > and a "get" indicates that the output comes with an extra reference. For > > > example, a function named of_get_node_by_name would clearly signal that > > > only the returned node is subject to a refcount increase while leaving > > > the input intact. > > > > > > Thus, while the current workaround "balances" the reference count, it > > > doesn't address the underlying design flaw. The naming still suggests a > > > "find" function that should leave the input untouched, which isn’t the > > > case here. A redesign of the API - with both the behavior and naming > > > aligned to common expectations - would be a clearer and more robust > > > solution. > > > > > > Nevertheless, it is only my POV, and the final decision rests with the > > > OpenFirmware framework maintainers. > > > > > > Best Regards, > > > Oleksij > > > > Hi, Oleksij, > > > > Thanks for your review. I think redesign the API would be a fundamental > > way to fix this issue, but it would cause impact for current users who > > knows the exact functionality of of_find_node_by_name(), which need to > > put the "from" before calling it (I can't make the assumption that all > > of drivers calling of_find_node_by_name() with a not-NULL "from" > > pointer, but not calling of_node_get() before is misuse). The basic idea > > for adding a new function is try not to impact current users. For users > > who are confused about the "_balanced" suffix,the comments of > > of_find_node_by_name() explains why this interface exists. > > I think we all agree that of_find_node_by_name() is miused, and that it > shows the API isn't optimal. What we have different opinions on is how > to make the API less error-prone. I think adding a new > of_find_node_by_name_balanced() function works around the issue and > doesn't improve the situation much, I would argue it makes things even > more confusing. > > We have only 20 calls to of_find_node_by_name() with a non-NULL first > argument in v6.14-rc1: > > arch/powerpc/platforms/chrp/pci.c: rtas = of_find_node_by_name (root, "rtas"); > > The 'root' variable here is the result of a call to > 'of_find_node_by_path("/")', so I think we could pass a null pointer > instead to simplify things. I think this could just be 'of_find_node_by_path("/rtas")' which does occur elsewhere. > arch/powerpc/platforms/powermac/pic.c: slave = of_find_node_by_name(master, "mac-io"); > > Here I believe of_find_node_by_name() is called to find a *child* node > of 'master'. of_find_node_by_name() is the wrong function for that. Yes, I think that's always a child of the PCI bus. > > arch/sparc/kernel/leon_kernel.c: np = of_find_node_by_name(rootnp, "GAISLER_IRQMP"); > arch/sparc/kernel/leon_kernel.c: np = of_find_node_by_name(rootnp, "01_00d"); > arch/sparc/kernel/leon_kernel.c: np = of_find_node_by_name(nnp, "GAISLER_GPTIMER"); > arch/sparc/kernel/leon_kernel.c: np = of_find_node_by_name(nnp, "01_011"); > > Here too the code seems to be looking for child nodes only (but I > couldn't find a DT example or binding in-tree, so I'm not entirely > sure). Sparc doesn't use OF_DYNAMIC, so who cares... But there are some Sparc DT dumps here: https://git.kernel.org/pub/scm/linux/kernel/git/davem/prtconfs.git/ > > drivers/clk/ti/clk.c: return of_find_node_by_name(from, tmp); > > Usage here seems correct, the reference-count decrement is intended. > > drivers/media/i2c/max9286.c: i2c_mux = of_find_node_by_name(dev->of_node, "i2c-mux"); > drivers/media/platform/qcom/venus/core.c: enp = of_find_node_by_name(dev->of_node, node_name); > drivers/net/dsa/bcm_sf2.c: ports = of_find_node_by_name(dn, "ports"); > drivers/net/dsa/hirschmann/hellcreek_ptp.c: leds = of_find_node_by_name(hellcreek->dev->of_node, "leds"); > drivers/net/ethernet/broadcom/asp2/bcmasp.c: ports_node = of_find_node_by_name(dev->of_node, "ethernet-ports"); > drivers/net/ethernet/marvell/prestera/prestera_main.c: ports = of_find_node_by_name(sw->np, "ports"); > drivers/net/pse-pd/tps23881.c: channels_node = of_find_node_by_name(priv->np, "channels"); > drivers/regulator/scmi-regulator.c: np = of_find_node_by_name(handle->dev->of_node, "regulators"); > drivers/regulator/tps6594-regulator.c: np = of_find_node_by_name(tps->dev->of_node, multi_regs[multi].supply_name); > > Incorrect usage, as far as I understand all those drivers are looking > for child nodes only. > > drivers/of/unittest.c: found = of_find_node_by_name(nd->overlay, "test-unittest16"); > drivers/of/unittest.c: found = of_find_node_by_name(nd->overlay, "test-unittest17"); > drivers/of/unittest.c: found = of_find_node_by_name(nd->overlay, "test-unittest18"); > drivers/of/unittest.c: found = of_find_node_by_name(nd->overlay, "test-unittest19"); > > Here too I think only child nodes are meant to be considered. > > of_find_node_by_name() is very much misused as most callers want to find > child nodes, while of_find_node_by_name() will walk the whole DT from a > given starting point. Agreed. In general, it's preferred to look for nodes by compatible rather than node name with child nodes being an exception. > I think the right fix here is to > > - Replace of_find_node_by_name(root, ...) with > of_find_node_by_name(NULL, ...) in arch/powerpc/platforms/chrp/pci.c > (if my understanding of the code is correct). > > - Replace of_find_node_by_name() with of_get_child_by_name() in callers > that need to search immediate children only (I expected that to be the > majority of the above call sites). +1 > - If there are other callers that need to find indirect children, > introduce a new of_get_child_by_name_recursive() function. > > At that point, the only remaining caller of of_find_node_by_name() > (beside its usage in the for_each_node_by_name() macro) will be > drivers/clk/ti/clk.c, which uses the function correctly. > > I'm tempted to then rename of_find_node_by_name() to > __of_find_node_by_name() to indicate it's an internal function not meant > to be called except in special cases. It could all be renamed to > __of_find_next_node_by_name() to make its behaviour clearer. +1 There's a number of functions which I classify as "don't add new users" which I keep meaning to document. It's generally older stuff still using them. Anything returning pointers to raw DT data are problematic for dynamic DT. Hence all the patches removing of_find_property/of_get_property calls. Rob