Re: GPIOLIB locking is broken and how to fix it

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Nov 25, 2023 at 09:13:22PM +0100, Bartosz Golaszewski wrote:
> On Sat, Nov 25, 2023 at 2:29 AM Kent Gibson <warthog618@xxxxxxxxx> wrote:
> >
> > On Fri, Nov 24, 2023 at 05:00:36PM +0100, Bartosz Golaszewski wrote:
> > > Hi!
> > >

> >
> > Generally I would advocate for not holding locks over callbacks.
> > If the callback requires locks then it (or the code it calls) should
> > request them itself.
> > GPIOLIB just needs to ensure the desc isn't freed during the callback.
> >
>
> That is quite tricky though, isn't it?

Well it is kernel code, so yeah.
You were expecting it to be easy ;-)?

> Let's consider our request/free
> example. Even if we start gpiod_request_commit() by calling
> gc->request() and then acquire the gpio_desc lock and modify the
> descriptor, we may have a concurrent gpiod_free_commit() that will do
> the same in reverse order. Even if the driver in question uses some
> locking, we'll have time between when it releases the lock and when we
> acquire the desc lock (or vice-versa for free). We can now end up in a
> situation when we call gc->request(), return from it,
> gpiod_free_commit() is called, acquires desc lock before
> gpiod_request_commit() marking the descriptor as freed, we then
> acquire the lock in gpiod_request_commit() and start marking the
> descriptor as requested while gc->free() is called. We cannot know the
> driver-specific state even if we wanted to re-check it within the
> critical section.
>

So you use transitional states - not just requested/freed.
Before the gc->request() call gpiod_request_commit() marks the desc as
being requested (while it holds the appropriate lock of course).
If the gpiod_free_commit() comes along and sees that mark, again while
holdiing the lock, then it marks the desc as to be freed and leaves it
for gpiod_request_commit() to cleanup.
When gpiod_request_commit() re-acquires the lock after the gc->request()
call it sees the to be freed mark and performs the cleanup instead of the
commit finalisation.

I'm generalising here and not looking at code, so the specifics may be
wrong, but this is a solvable problem.

> This particular example could be a non-issue if we only dealt out
> descriptors for exclusive usage to consumers but unfortunately we
> don't. We currently have lots of places in drivers/gpio/ and - oh
> horror! - outside where users just shamelessly access the descriptor
> array in a GPIO device and fiddle with its internals so this may
> happen. That's another problem. But as we read and/or modify the
> descriptors in almost all GPIO callbacks, this isn't the only use-case
> that can lead to undefined behavior.
>

That is a very different problem - another breach of contract, or lack
of contract. And a problem that can't be solved by changing the locks.

> I'd say that driver-specific locking should be used whenever the
> driver resources are shared between different subsystems. Some drivers
> expose a GPIO chip but also an I2C expander or a PWM chip and they all
> share resources. Then we also have power-management callbacks and
> whatnot.
>
> For GPIOLIB callbacks exclusively, I think we should have a hard
> contract for concurrency. Which brings me neatly to...
>
> > The contract for the driver callbacks is not clear to me. e.g. can they
> > be called concurrently (e.g. different cdev requests trying to set
> > different lines and so both calling gc->set())
>
> AFAIK there is no contract. None is documented and - with GPIOLIB's
> current state - no implicit contract is enforced either.
>

And that is the actual problem here - the lack of contracts.

> > If so, are the drivers aware of that?
> > If not, a mutex over the callbacks would seem very appropriate to
> > enforce that contract, independent of protecting the desc.
>
> Not a mutex. Not exclusively anyway. I suppose we could go with an
> abstraction layer like what regmap does. If the chip is marked as
> can_sleep then we use mutex, if not, we use spinlock. Possibly have a
> flag allowing users to disable locking if they are certain concurrency
> is not an issue.
>
> > (though ideally the drivers are be aware that they need to provide
> > their own locking as appropriate)
> >
>
> We should put a contract in place saying: if your driver is certain to
> not share any resources with any other subsystem: don't use your own
> locking. Otherwise, protect only the shared structures.
>

This is why I advocate for not holding any locks during the callbacks -
then there are no restrictions imposed on the driver and no way it can
inadvertantly deadlock.

> > > One idea I have is introducing a strict limit on which functions can
> > > be used from atomic context (we don't enforce anything ATM in
> > > functions that don't have the _cansleep suffix in their names) and
> > > check which parts of the descriptor struct they modify. Then protect
> > > these parts with a spinlock in very limited critical sections. Have a
> > > mutex for everything else that can only be accessed from process
> > > context.
> > >
> >
> > ... and needs to call cansleep within the critical section itself.
> >
> > > Another one is introducing strict APIs like gpiod_set_value_atomic()
> > > that'll be designed to be called from atomic context exclusively and
> > > be able to handle it. Everything else must only be called from process
> > > context. This of course would be a treewide change as we'd need to
> > > modify all GPIO calls in interrupt handlers.
> > >
> > > I'd like to hear your ideas as this change is vital before we start
> > > protecting gdev->chip with SRCU in all API calls.
> > >
> >
> > As above, I'm not clear on the driver callback contract, so there is a
> > good chance anything more specific I have to add would be incoherent.
> >
> > I do have concerns that adding or changing locking could inadvertantly
> > constrain otherwise valid concurrency, but safe is better than fast.
> > But safe AND fast is ideal.
>
> This is why my ultimate goal is to protect the gpio_chip pointer in
> gpio_device against sudden disappearance of the provider with SRCU -
> possibly the fastest sleepable locking mechanism we have in the
> kernel.
>
> SRCU could also be used to check if a desc is requested before
> entering the proper API call in the ideal world where we don't allow
> using non-requested descriptors making it possible to avoid real locks
> in struct gpio_desc.
>

So why allow the usage of non-requested descriptors?

Cheers,
Kent.





[Index of Archives]     [Linux SPI]     [Linux Kernel]     [Linux ARM (vger)]     [Linux ARM MSM]     [Linux Omap]     [Linux Arm]     [Linux Tegra]     [Fedora ARM]     [Linux for Samsung SOC]     [eCos]     [Linux Fastboot]     [Gcc Help]     [Git]     [DCCP]     [IETF Announce]     [Security]     [Linux MIPS]     [Yosemite Campsites]

  Powered by Linux