Re: GPIOLIB locking is broken and how to fix it

Linus Walleij <linus.walleij@xxxxxxxxxx> · Sat, 25 Nov 2023 00:20:49 +0100

On Fri, Nov 24, 2023 at 5:00 PM Bartosz Golaszewski <brgl@xxxxxxxx> wrote:

> There are more instances of this pattern. This seems to be a way to
> work around the fact that we have GPIO API functions that can be
> called from atomic context (gpiod_set/get_value(),
> gpiod_direction_input/output(), etc.) that in their implementation
> call driver callbacks that may as well sleep (gc->set(),
> gc->direction_output(), etc.).

Correct, AFAIK this is why this looks like it does.

> Protecting the list of GPIO devices is simple. It should be a mutex as
> the list should never be modified from atomic context.

If you are referring to gpio_lock then go back and look how that
got where it is today:

git checkout d2876d08d86f2 (initial gpiolib commit 2008)

The ultimately confusing thing is that:

1. Yes it is protecting the list of gpio chips
2. The struct holding items in the list of the gpio chips is
  called gpio_desc...

Then this comment:

/* gpio_lock prevents conflicts during gpio_desc[] table updates.
 * While any GPIO is requested, its gpio_chip is not removable;
 * each GPIO's "requested" flag serves as a lock and refcount.
 */

Is cargo-culted to present day and is talking about gpio_desc, but
nowadays gpio_desc means something completely different...

OK I'll send a patch just deleting this comment, it looks insane.

> This can be easily factored out right now.

JustDoIt :)

> Protecting GPIO descriptors is
> trickier. If we use a spinlock for that, we'll run into problems with
> GPIO drivers that can sleep. If we use a mutex, we'll have a problem
> with users calling GPIO functions from atomic context.
>
> One idea I have is introducing a strict limit on which functions can
> be used from atomic context (we don't enforce anything ATM in
> functions that don't have the _cansleep suffix in their names) and
> check which parts of the descriptor struct they modify. Then protect
> these parts with a spinlock in very limited critical sections. Have a
> mutex for everything else that can only be accessed from process
> context.
>
> Another one is introducing strict APIs like gpiod_set_value_atomic()
> that'll be designed to be called from atomic context exclusively and
> be able to handle it. Everything else must only be called from process
> context. This of course would be a treewide change as we'd need to
> modify all GPIO calls in interrupt handlers.

This is a much harder problem.

Many of the current API functions can be called from atomic and
nonatomic contexts alike :/ this has historical reasons of course.
Back in 2008 most GPIO chips were just on-SoC and resulted in
a register write: no problem. Now we have quite a bunch of GPIOs
on I2C, SPI ... and the API looks the same.

The _cansleep functions were supposed to be used explicitly in
places where it is OK that the GPIO can sleep (as in: I don't care
if you sleep or not), and every other site using the non-_cansleep
versions should be where it has to be atomic.

Every call where non-_cansleep is called but it doesn't matter is
essentially a bug, they should be using _cansleep versions.
(Oh boy ... such much bug.)

In 2008 when it was introduced, *_cansleep had one single user:
drivers/leds/led-gpio.c because it was assumed that users of
that driver would not care if LEDs are on GPIO expanders or
on SoC-resident GPIOs. Ironically that driver now keeps track
of whether the GPIO is sleepable or not...

So if you propose turning this on it's head by creating *_atomic
and opt-in to atomic behaviour instead of opting out of it, I'd say
yes, but only if we delete all uses of _cansleep at the same time
and that means the default behaviour becomes _cansleep.

Keeping both around at the same time is going to be
a complete mess.

Yours,
Linus Walleij