Re: [PATCH V2 0/7] mmc: Several fixes for bcm2835 driver

Michal Suchánek <msuchanek@xxxxxxx> · Thu, 28 Mar 2019 21:43:51 +0100

On Fri, 22 Mar 2019 18:10:11 +0100 (CET)
Stefan Wahren <stefan.wahren@xxxxxxxx> wrote:

> Hi Michal,
> 
> > Michal Suchánek <msuchanek@xxxxxxx> hat am 22. März 2019 um 17:06 geschrieben:
> > 
> > 
> > On Fri, 22 Mar 2019 15:45:13 +0100
> > Stefan Wahren <stefan.wahren@xxxxxxxx> wrote:
> >   
> > > Hi Michal,
> > > 
> > > Am 21.03.19 um 21:03 schrieb Michal Suchánek:
> > > 
> > > could you please retry with mainline kernel 5.0?  
> > 
> > I can try that. What I have is pretty much 5.0 anyway so I don't expect
> > much difference:
> >   
> 
> as long as the issue lies in the sdhost driver code. There also has been a lot of fixes by Lukas Wunner to the DMA engine driver. I prefer a well defined source base.
> 
> > 
> > I suspect that one of the locking fixes that went into mainline
> > recently prevents recovering from the error but I did not try
> > reverting them yet.  
> 
> Would be nice if you can find this regression.

Does not look that good.

# bad: [a48caea1745f30e87ab5a8dba5e365d0346aa600] mmc: bcm2835: Drop DMA channel error pointer check (bsc#983145).
# good: [c6b26547caa816608ea5c5717b29c78769a22972] mmc: bcm2835: reset host on timeout (bsc#983145, bsc#1070872).
git bisect start 'a48caea1745f30e87ab5a8dba5e365d0346aa600' 'c6b26547caa816608ea5c5717b29c78769a22972'
# good: [3eb1fe752f52865eff0f9b8edd95b61b6a9c1010] mmc: bcm2835: Terminate timeout work synchronously (bsc#983145, bsc#1070872).
git bisect good 3eb1fe752f52865eff0f9b8edd95b61b6a9c1010
# good: [50b4dd03bd11fbb647f0edbed8501290f6f9ea46] mmc: bcm2835: Properly handle dmaengine_prep_slave_sg (bsc#983145).
git bisect good 50b4dd03bd11fbb647f0edbed8501290f6f9ea46

On the good commits a few timeouts occur and are recovered. This leaves 

8c9620b1cc9b69e82fa8d4081d646d0016b602e7  mmc: bcm2835: Fix DMA channel leak on probe error
e5c1e63c932379b89d7404d4e5fde1bf8abff951  mmc: bcm2835: Drop DMA channel error pointer check

which are not overly suspect. The latter should be noop at least. So
maybe the indefinite lockup depends on card state or some other factor.
Also the recovery in the case when it does recover does take quite long
(like minutes) with the system pretty much completely non-responsive.
Not waiting long enough for the system to recover might be also a
factor.

> 
> > It takes quite a while to reimage, boot different
> > kernel, run system update (which now invariably crashes/locks up with
> > any kernel).   
> 
> That's why i prefer Raspbian during development, where i can simply replace the kernel / modules with a sd card reader on a PC:
> 
> https://gist.github.com/lategoodbye/c7317a42bf7f9c07f5a91baed8c68f75
> 
> This should reduce the round-trip, but this could accidently hide the problem as well.

I can replace the kernel easily but without going back to the obsolete
system state I lose the system upgrade as test case.

Thanks 

Michal