Re: [PROBLEM] spi driver internal error during boot on sparx5

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 1 Sep 2022 13:42:19 +0200
Vincent Whitchurch <vincent.whitchurch@xxxxxxxx> wrote:

> On Thu, Sep 01, 2022 at 01:02:22PM +0200, David Jander wrote:
> > On Thu, 1 Sep 2022 08:57:37 +0200
> > Vincent Whitchurch <vincent.whitchurch@xxxxxxxx> wrote:  
> > > You can reproduce the problem with:
> > > 
> > >  git remote add vwax https://github.com/vwax/linux.git
> > >  git fetch vwax
> > >  git archive vwax/roadtest/devel tools/testing/roadtest | tar xf -
> > >  make -C tools/testing/roadtest/ -j24 OPTS="-k adc --rt-timeout 10 -v"
> > > 
> > > The test passes on v5.19 but on current mainline it hangs and times out.  
> > 
> > This is very nice. Thanks.  
> 
> Great to hear that it was useful!
> 
> > I followed your instructions, and if I apply the following, all tests are
> > passed.
> > 
> > diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
> > index 83da8862b8f2..32c01e684af3 100644
> > --- a/drivers/spi/spi.c
> > +++ b/drivers/spi/spi.c
> > @@ -1727,8 +1727,7 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread)
> >         spin_unlock_irqrestore(&ctlr->queue_lock, flags);
> >  
> >         ret = __spi_pump_transfer_message(ctlr, msg, was_busy);
> > -       if (!ret)
> > -               kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);
> > +       kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);
> >  
> >         ctlr->cur_msg = NULL;
> >         ctlr->fallback = false;
> > 
> > The problem is that if __spi_pump_transfer_message() fails, the ctlr->busy
> > flag is left true, so __spi_async() is not going to queue new work. The busy
> > transition is handled right above that piece of code, in
> > __spi_pump_transfer_message(), and the mechanism is to queue more work to do
> > it. Apparently work was only queued when the transfer was successful, and I am
> > not sure why it was like that. Queuing work unconditionally solves the issue
> > and should not be a problem.
> > 
> > Curious thing is, that this change alone is sufficient to make all the
> > roadtest tests pass. But I still think that does not fix the bug reported by
> > Casper. For that, Mark's patch is also necessary.  
> 
> Yes, I noticed that too.  If you comment out the two usages of
> "@flaky_bus" in tools/testing/roadtest/roadtest/tests/iio/adc/test_adc084s021.py then
> the bus error injection will be disabled and only the success paths will
> be tested.  If you do that and test ae7d2346dc89ae89a6e0 ("spi: Don't
> use the message queue if possible in spi_sync") with the crash fix and
> iio sysfs fixes cherry-picked in, you'll see the following failure
> without Mark's patch:
> 
>  FAILED roadtest/tests/iio/adc/test_adc084s021.py::test_illuminance - BlockingIOError: [Errno 115] Operation now in progress
> 
> But if you move forward to 69fa95905d40846756d22 ("spi: Ensure the
> io_mutex is held until spi_finalize_current_message()"), then the tests
> start passing again, if they're run without the error injection.
> 
> So 69fa95905d40846756d22 seems to be masking the original problem from
> ae7d2346dc89ae89a6e0 (while at the same time introducing the other
> problem in the error handling).

The following patch set should be viewed as one single change:

dc3029056b02 spi: opportunistically skip ctlr->cur_msg_completion
69fa95905d40 spi: Ensure the io_mutex is held until spi_finalize_current_message()
72c5c59b659d spi: Set ctlr->cur_msg also in the sync transfer case
1a9cafcb57b7 spi: Remove unneeded READ_ONCE for ctlr->busy flag
66a221593cb2 spi: Remove the now unused ctlr->idling flag
049d6ccc4da8 spi: Remove check for idling in __spi_pump_messages()
d5256cce1f50 spi: Remove check for controller idling in spi sync path
8711a2ab51dd spi: __spi_pump_messages: Consolidate spin_unlocks to goto target
c1038165fbbf spi: Lock controller idling transition inside the io_mutex
ae7d2346dc89 spi: Don't use the message queue if possible in spi_sync
1714582a3a08 spi: Move ctlr->cur_msg_prepared to struct spi_message

I left the individual steps, in order to make it easier to view how the code
mutated, but the individual steps have had very little testing in between, and
some have known problems in corner-cases.
Particularly patch 69fa95905d40 does fix support for controllers that do not
call spi_finalize_current_message() in the same context as transfer_one(), for
example from IRQ, or in the case of the mux from a completion. If you replace
spi_async() with spi_sync() in spi-mux.c you will accomplish the same.

Best regards,

-- 
David Jander



[Index of Archives]     [Linux Kernel]     [Linux ARM (vger)]     [Linux ARM MSM]     [Linux Omap]     [Linux Arm]     [Linux Tegra]     [Fedora ARM]     [Linux for Samsung SOC]     [eCos]     [Linux Fastboot]     [Gcc Help]     [Git]     [DCCP]     [IETF Announce]     [Security]     [Linux MIPS]     [Yosemite Campsites]

  Powered by Linux