Hi Paul et al, I have recently received two independent reports of a performance regression in the GPIO character device. In both cases it was bisected to commit 1f2bcb8c8ccd ("gpio: protect the descriptor label with SRCU"). I assume it's caused by the wait imposed by synchronize_srcu() in desc_set_label(). In both cases the drop in performance is caused by the fact that the user-space code in question does something like this (treat it as pseudocode as in both cases users were using libgpiod v1.6.x but the result with libgpiod v2 would be the same): for (;;) { req = gpiod_chip_request_lines(chip, ...); gpiod_line_request_get/set_value(req, ...); gpiod_line_request_release(req); } This pattern can be seen in code that implements some bitbanging protocols etc. from user-space - programs that call request/release in quick succession can indeed see a slow-down now with SRCU. Please note that this is almost certainly wrong: in general the user process should request the GPIO, keep it requested and then perform any accesses as required. What this code does, would be analogous to the following code in the kernel: for (;;) { desc = gpiod_get(dev, "foo", ...); gpiod_set_value(desc, ...); gpiod_put(desc); } Of course what drivers actually do is: call gpiod_get() once in probe() and free the descriptor with gpiod_put() in remove() and user-space should follow the same pattern. While I could just brush it off and tell the users to fix their code as the libgpiod test-suite which executes a more realistic set of operations actually runs slightly faster after the recent rework, I assume I'll be getting more reports like this so I want to look into it and see if something can be done. It turns out that performance stays the same if the thread running the above code is pinned to a single CPU (for example: using pthread_setaffinity_np()). Is this normal for SRCU to wait for much longer if this is not the case? I don't know enough to understand why this is the case. The offending kernel code looks like this: old = rcu_replace_pointer(desc->label, new, 1); synchronize_srcu(&desc->srcu); kfree_const(old); I was wondering if we even have to synchronize here? The corresponding read-only sections call srcu_dereference(desc->label, &desc->srcu). Would it be enough to implement kfree_const_rcu() and use it here without synchronizing? Would the read-only users correctly see that last dereferenced address still points to valid memory until they all release the lock and the memory would only then be freed? Is my understanding of kfree_rcu() correct? Bart