Hi Dirk, hope you are doing well! Thanks for doing this analysis and summary! Much appreciated. > Comparing this, the questions are: > > * tmio uses irqsave spin locks, while the bcm2835 uses mutex for locking. > Why is irqsave spin lock used here? The spinlock was introduced with 6ff56e0d8e02 ("mmc: tmio_mmc: handle missing HW interrupts"). This was before my time with Renesas, but the rule of thumb for chosing spinlocks over mutexes: they are lightweight. If the critical sections are short and no sleeping is needed within them, they can save context switching. Of course, this is not needed for a timeout handler, but in other code paths this is more interesting. > * In tmio the mmc_request_done() is *outside* the lock, while for bcm2835 > its *inside* the lock protected section. Why does tmio doesn't lock the > access to mmc_request_done()? The spinlock is to protect the private data structure, 'host' in our case. mmc_request_done() doesn't operate on it, only with MMC core structures 'mmc_host' and 'mmc_request'. AFAICS that doesn't need protection. The mtk-sd driver has it also outside the critical section. But: clearing host->mrq looks like it should be *in* the critical section. There is another case in tmio_mmc_set_ios() which also looks suspicious at least. I'll double check all this tomorrow. > * How does tmio_mmc_reset_work() ensures that the content mrq points to and > passed to mmc_request_done() is still the correct content and hasn't changed > asynchronously in the background (due to being outside the lock)? Because it copies 'host->mrq' to a local variable 'mrq' within a critical section: 236 spin_lock_irqsave(&host->lock, flags); 237 mrq = host->mrq; and then passes this to mmc_request_done(). It even clears host->mrq before that to enable asynchronous processing of the next mrq: 270 host->mrq = NULL; 271 mmc_request_done(host->mmc, mrq); > * Why does tmio doesn't use cancel_delayed_work()? Maybe because its ensured > by the interrupt locking no further work is scheduled? If you want to cancel it, I think you should rather use cancel_delayed_work_sync() to really know that there is nothing pending anymore. But that function sleeps. TMIO approach is to simply bail out on later calls: 239 /* 240 * is request already finished? Since we use a non-blocking 241 * cancel_delayed_work(), it can happen, that a .set_ios() call preempts 242 * us, so, have to check for IS_ERR(host->mrq) 243 */ 244 if (IS_ERR_OR_NULL(mrq) || 245 time_is_after_jiffies(host->last_req_ts + 246 msecs_to_jiffies(CMDREQ_TIMEOUT))) { 247 spin_unlock_irqrestore(&host->lock, flags); 248 return; 249 } Other drivers do this as well, like alcor: 318 319 /* 320 * If this work gets rescheduled while running, it will 321 * be run again afterwards but without any active request. 322 */ 323 if (!host->mrq) 324 return; 325 I'd think this part is OK but I will double check as well.^ > In sum looking at the KASAN report and the locking in tmio_mmc_reset_work() > we have an uncomfortable feeling. bcm2835_timeout() looks much safer > regarding the locking. Of course that is just a feeling and can't be proven. Yes, with a mutex and putting everything inside it, it uses a big hammer and no questions asked. I am not complaining here, that is a sensible approach at times. We have, however, historically been introduced with spinlocks and a more finegrained approach. I prefer to (and will do) double check the critical sections first before considering more intrusive approaches which would need a lot of testing, of course. > Therefore we want to ask here :) I hope I could help so far. I am open for discussion, of course! All the best, Wolfram
Attachment:
signature.asc
Description: PGP signature