Re: [BUG] gpiolib: cdev: can't read RELEASED event for last line

Kent Gibson <warthog618@xxxxxxxxx> · Fri, 26 May 2023 14:51:04 +0800

On Fri, May 26, 2023 at 08:45:38AM +0800, Kent Gibson wrote:
> On Thu, May 25, 2023 at 09:21:24PM +0800, Kent Gibson wrote:
> > On Thu, May 25, 2023 at 03:46:12PM +0800, Kent Gibson wrote:
> > > On Thu, May 25, 2023 at 03:09:26PM +0800, Kent Gibson wrote:
> > > > On Thu, May 25, 2023 at 11:09:52AM +0800, Kent Gibson wrote:
> > > > > Hi Bart,
> > > > > 
> > > > I can also confirm that receiving the event using a blocking read() on the
> > > > fd still works, it is a poll() on the fd followed by a read() that fails.
> > > > 
> > > 
> > > Hmmm, so it occurred to me that gpionotify does the poll()/read(), so it
> > > should exhibit the bug.  But no, it doesn't.
> > > 
> > > So it could be my code doing something boneheaded??
> > > Or there is some other variable at play.
> > > I'll try to write a test for it with libgpiod and see I can reproduce
> > > it.  But I might put it on the back burner - this one isn't terribly
> > > high priority.
> > > 
> > 
> > Bisect result:
> > 
> > [bdbbae241a04f387ba910b8609f95fad5f1470c7] gpiolib: protect the GPIO device against being dropped while in use by user-space
> > 
> > So, the semaphores patch.
> > The Rust test gets the timings right to hit a race/order of events issue?
> > 
> 
> Well that throws some new light on the problem.
> One of the differentiators of the Rust test from the other ways I was
> trying to reproduce the problem is that the Rust test is multithreaded.
> 
> This being semaphore related makes other weirdness I was seeing make
> sense.
> The original form of that test was locking up when the bg thread
> released the line.  The drop never returned and the fg thread never
> received a RELEASED event.  That wasn't a problem with the drop, or an
> event being lost, as I assumed, that was a DEADLOCK.
> 
> The current form of the test uses message passing to coordinate the
> threads, so that deadlock condition doesn't occur any more.
> (That change was because of my concern that the lack of buffering of info
> changed events in the kernel could result in lost events - and the goal of
> the test was to test my iterator, not the kernel...)
> 
> I'll revert that test case to see if I can reproduce the deadlock case.
> 
> But it looks like those semaphores have problems. At least one path
> might lead to deadlock and another leads to an inconsistent state.
> 

I have failed to reproduce the deadlock case, which is good news, I
guess.  Not sure what the issue was then, but not seeing it now.
I dealing with kernel crashes at the time, so perhaps my kernel was
in a weird state?

Anyway, I've split the problem ENODEV test case out into a new repo[1]
so it is easier to play with.  In the longer term the idea is to create a
regression test suite for the uAPI.

You can run the tests by checking out the repo and calling "cargo test".

At the moment there are two tests in there - the enodev, and the
"deadlock" (well the test that was hanging on me, but now isn't).
The deadlock test is just for reference.

I've reduced the enodev case as much as I can to try to identify the
root cause.

It could be related to the event queue somehow, as it passes if it
doesn't wait for the bg thread to complete and reads the events as they
arrive.  So the two threads do not run concurrently when it fails.

It also passes if the request/drop is performed in the same thread.
So it requires the bg thread.

Not sure if that helps any, but that is where I'm at - still puzzled.

Cheers,
Kent.

[1]https://github.com/warthog618/gurt-rs