Performance regression in GPIOLIB with SRCU when using the user-space ABI in a wrong way

Bartosz Golaszewski <brgl@xxxxxxxx> · Mon, 6 May 2024 14:32:57 +0200

Hi Paul et al,

I have recently received two independent reports of a performance
regression in the GPIO character device. In both cases it was bisected
to commit 1f2bcb8c8ccd ("gpio: protect the descriptor label with
SRCU"). I assume it's caused by the wait imposed by synchronize_srcu()
in desc_set_label().

In both cases the drop in performance is caused by the fact that the
user-space code in question does something like this (treat it as
pseudocode as in both cases users were using libgpiod v1.6.x but the
result with libgpiod v2 would be the same):

    for (;;) {
        req = gpiod_chip_request_lines(chip, ...);
        gpiod_line_request_get/set_value(req, ...);
        gpiod_line_request_release(req);
    }

This pattern can be seen in code that implements some bitbanging
protocols etc. from user-space - programs that call request/release in
quick succession can indeed see a slow-down now with SRCU.

Please note that this is almost certainly wrong: in general the user
process should request the GPIO, keep it requested and then perform
any accesses as required. What this code does, would be analogous to
the following code in the kernel:

    for (;;) {
        desc = gpiod_get(dev, "foo", ...);
        gpiod_set_value(desc, ...);
        gpiod_put(desc);
    }

Of course what drivers actually do is: call gpiod_get() once in
probe() and free the descriptor with gpiod_put() in remove() and
user-space should follow the same pattern.

While I could just brush it off and tell the users to fix their code
as the libgpiod test-suite which executes a more realistic set of
operations actually runs slightly faster after the recent rework, I
assume I'll be getting more reports like this so I want to look into
it and see if something can be done.

It turns out that performance stays the same if the thread running the
above code is pinned to a single CPU (for example: using
pthread_setaffinity_np()). Is this normal for SRCU to wait for much
longer if this is not the case? I don't know enough to understand why
this is the case.

The offending kernel code looks like this:

    old = rcu_replace_pointer(desc->label, new, 1);
    synchronize_srcu(&desc->srcu);
    kfree_const(old);

I was wondering if we even have to synchronize here? The corresponding
read-only sections call srcu_dereference(desc->label, &desc->srcu).
Would it be enough to implement kfree_const_rcu() and use it here
without synchronizing? Would the read-only users correctly see that
last dereferenced address still points to valid memory until they all
release the lock and the memory would only then be freed? Is my
understanding of kfree_rcu() correct?

Bart

Performance regression in GPIOLIB with SRCU when using the user-space ABI in a *wrong* way

Performance regression in GPIOLIB with SRCU when using the user-space ABI in a wrong way