Re: Performance regression in GPIOLIB with SRCU when using the user-space ABI in a *wrong* way

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, May 06, 2024 at 02:32:57PM +0200, Bartosz Golaszewski wrote:
> Hi Paul et al,
> 
> I have recently received two independent reports of a performance
> regression in the GPIO character device. In both cases it was bisected
> to commit 1f2bcb8c8ccd ("gpio: protect the descriptor label with
> SRCU"). I assume it's caused by the wait imposed by synchronize_srcu()
> in desc_set_label().
> 
> In both cases the drop in performance is caused by the fact that the
> user-space code in question does something like this (treat it as
> pseudocode as in both cases users were using libgpiod v1.6.x but the
> result with libgpiod v2 would be the same):
> 
>     for (;;) {
>         req = gpiod_chip_request_lines(chip, ...);
>         gpiod_line_request_get/set_value(req, ...);
>         gpiod_line_request_release(req);
>     }
> 
> This pattern can be seen in code that implements some bitbanging
> protocols etc. from user-space - programs that call request/release in
> quick succession can indeed see a slow-down now with SRCU.
> 
> Please note that this is almost certainly wrong: in general the user
> process should request the GPIO, keep it requested and then perform
> any accesses as required. What this code does, would be analogous to
> the following code in the kernel:
> 
>     for (;;) {
>         desc = gpiod_get(dev, "foo", ...);
>         gpiod_set_value(desc, ...);
>         gpiod_put(desc);
>     }
> 
> Of course what drivers actually do is: call gpiod_get() once in
> probe() and free the descriptor with gpiod_put() in remove() and
> user-space should follow the same pattern.
> 
> While I could just brush it off and tell the users to fix their code
> as the libgpiod test-suite which executes a more realistic set of
> operations actually runs slightly faster after the recent rework, I
> assume I'll be getting more reports like this so I want to look into
> it and see if something can be done.
> 
> It turns out that performance stays the same if the thread running the
> above code is pinned to a single CPU (for example: using
> pthread_setaffinity_np()). Is this normal for SRCU to wait for much
> longer if this is not the case? I don't know enough to understand why
> this is the case.
> 
> The offending kernel code looks like this:
> 
>     old = rcu_replace_pointer(desc->label, new, 1);
>     synchronize_srcu(&desc->srcu);
>     kfree_const(old);
> 
> I was wondering if we even have to synchronize here? The corresponding
> read-only sections call srcu_dereference(desc->label, &desc->srcu).
> Would it be enough to implement kfree_const_rcu() and use it here
> without synchronizing? Would the read-only users correctly see that
> last dereferenced address still points to valid memory until they all
> release the lock and the memory would only then be freed? Is my
> understanding of kfree_rcu() correct?

It looks like kfree_const() just does a kfree(), so why not use
call_srcu() to replace the above calls to synchronize_srcu() and
kfree_const()?

Something like this:

	if (!is_kernel_rodata((unsigned long)(old)))
		call_srcu(&desc->srcu, &desc->rh, gpio_cb);

This requires adding an rcu_head field named "rh" to the structure
referenced by "desc" and creating a gpio_cb() wrapper function:

static void connection_release(struct rcu_head *rhp)
{
	struct beats_me *bmp = container_of(rhp, struct beats_me, rh);

	kfree(bmp);
}

I could not find the code, so I don't know what "beats_me" above
should be replaced with.

Would that work?

							Thanx, Paul




[Index of Archives]     [Linux SPI]     [Linux Kernel]     [Linux ARM (vger)]     [Linux ARM MSM]     [Linux Omap]     [Linux Arm]     [Linux Tegra]     [Fedora ARM]     [Linux for Samsung SOC]     [eCos]     [Linux Fastboot]     [Gcc Help]     [Git]     [DCCP]     [IETF Announce]     [Security]     [Linux MIPS]     [Yosemite Campsites]

  Powered by Linux