Hi Dirk, > > With the to-be-fixed commit, the reset_work handler cleared 'host->mrq' > > outside of the spinlock protected critical section. That leaves a small > > race window during execution of 'tmio_mmc_reset()' where the done_work > > handler could grab a pointer to the now invalid 'host->mrq'. Both would > > use it to call mmc_request_done() causing problems (see Link). > > > > However, 'host->mrq' cannot simply be cleared earlier inside the > > critical section. That would allow new mrqs to come in asynchronously > > while the actual reset of the controller still needs to be done. So, > > like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from > > coming in but still avoiding concurrency between work handlers. > > > > Reported-by: Dirk Behme <dirk.behme@xxxxxxxxxxxx> > > Closes: https://lore.kernel.org/all/20240220061356.3001761-1-dirk.behme@xxxxxxxxxxxx/ > > Signed-off-by: Wolfram Sang <wsa+renesas@xxxxxxxxxxxxxxxxxxxx> > > Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race") > > Tested-by: Dirk Behme <dirk.behme@xxxxxxxxxxxx> > Reviewed-by: Dirk Behme <dirk.behme@xxxxxxxxxxxx> Awesome! Thanks for the super-fast tags! > At least the issues we observed before are not seen any more. As we are not > exactly sure on the root cause, of course this is not a 100% proof. But as > the change looks good, looks like it won't break something and the system > behaves good with it I would say we are good to go. I agree. We don't know if it is all you need. But there definitely was a race window and closing it removes some observed anomalies. Let's hope all of them :) I looked many times at the code and, to the best of my knowledge, don't see side effects. 'host->mrq' stays non-NULL, so new mrqs won't be added like before. Changing it to an ERR_PTR will only affect the check in the done_work handler which is what we want. But, of course, more eyes are always welcome. > I think we could add anything like > > Cc: stable@xxxxxxxxxxxxxxx # 3.0+ Yes, we should definitely have that. I would have added it once your testing got good results. This affects every Renesas SDHI or Uniphier SD instance since 3.0 (12 years). Wow! So, thanks a ton for your report and assistance in debugging it. Very much appreciated! And, phew, I am happy that this solution does not make the locking more complex \o/ All the best, Wolfram
Attachment:
signature.asc
Description: PGP signature