On Mon, May 06, 2024 at 02:32:57PM +0200, Bartosz Golaszewski wrote: > Hi Paul et al, > > I have recently received two independent reports of a performance > regression in the GPIO character device. In both cases it was bisected > to commit 1f2bcb8c8ccd ("gpio: protect the descriptor label with > SRCU"). I assume it's caused by the wait imposed by synchronize_srcu() > in desc_set_label(). > > In both cases the drop in performance is caused by the fact that the > user-space code in question does something like this (treat it as > pseudocode as in both cases users were using libgpiod v1.6.x but the > result with libgpiod v2 would be the same): > > for (;;) { > req = gpiod_chip_request_lines(chip, ...); > gpiod_line_request_get/set_value(req, ...); > gpiod_line_request_release(req); > } > > This pattern can be seen in code that implements some bitbanging > protocols etc. from user-space - programs that call request/release in > quick succession can indeed see a slow-down now with SRCU. > > Please note that this is almost certainly wrong: in general the user > process should request the GPIO, keep it requested and then perform > any accesses as required. What this code does, would be analogous to > the following code in the kernel: > > for (;;) { > desc = gpiod_get(dev, "foo", ...); > gpiod_set_value(desc, ...); > gpiod_put(desc); > } > > Of course what drivers actually do is: call gpiod_get() once in > probe() and free the descriptor with gpiod_put() in remove() and > user-space should follow the same pattern. > > While I could just brush it off and tell the users to fix their code > as the libgpiod test-suite which executes a more realistic set of > operations actually runs slightly faster after the recent rework, I > assume I'll be getting more reports like this so I want to look into > it and see if something can be done. > > It turns out that performance stays the same if the thread running the > above code is pinned to a single CPU (for example: using > pthread_setaffinity_np()). Is this normal for SRCU to wait for much > longer if this is not the case? I don't know enough to understand why > this is the case. > > The offending kernel code looks like this: > > old = rcu_replace_pointer(desc->label, new, 1); > synchronize_srcu(&desc->srcu); > kfree_const(old); > > I was wondering if we even have to synchronize here? The corresponding > read-only sections call srcu_dereference(desc->label, &desc->srcu). > Would it be enough to implement kfree_const_rcu() and use it here > without synchronizing? Would the read-only users correctly see that > last dereferenced address still points to valid memory until they all > release the lock and the memory would only then be freed? Is my > understanding of kfree_rcu() correct? It looks like kfree_const() just does a kfree(), so why not use call_srcu() to replace the above calls to synchronize_srcu() and kfree_const()? Something like this: if (!is_kernel_rodata((unsigned long)(old))) call_srcu(&desc->srcu, &desc->rh, gpio_cb); This requires adding an rcu_head field named "rh" to the structure referenced by "desc" and creating a gpio_cb() wrapper function: static void connection_release(struct rcu_head *rhp) { struct beats_me *bmp = container_of(rhp, struct beats_me, rh); kfree(bmp); } I could not find the code, so I don't know what "beats_me" above should be replaced with. Would that work? Thanx, Paul