Seungwon, On Mon, May 12, 2014 at 9:52 PM, Seungwon Jeon <tgih.jun@xxxxxxxxxxx> wrote: > Hi Doug, > > On Tue, May 13, 2014, Doug Anderson wrote: >> Seungwon, >> >> On Sat, May 10, 2014 at 7:11 AM, Seungwon Jeon <tgih.jun@xxxxxxxxxxx> wrote: >> > On Fri, May 09, 2014, Sonny Rao wrote: >> >> On Thu, May 8, 2014 at 2:42 AM, Yuvaraj Kumar <yuvaraj.cd@xxxxxxxxx> wrote: >> >> > Any comments on this patch? >> >> > >> >> >> >> I'll just add that without this fix, running the tuning loop for UHS >> >> modes is not reliable on dw_mmc because errors will happen and you >> >> will eventually hit this race and hang. This can happen any time >> >> there is tuning like during boot or during resume from suspend. >> >> >> >> > On Thu, Mar 27, 2014 at 11:48 AM, Yuvaraj Kumar C D >> >> > <yuvaraj.cd@xxxxxxxxx> wrote: >> >> >> From: Doug Anderson <dianders@xxxxxxxxxxxx> >> >> >> >> >> >> If we happened to get a data error at just the wrong time the dw_mmc >> >> >> driver could get into a state where it would never complete its >> >> >> request. That would leave the caller just hanging there. >> >> >> >> >> >> We fix this two ways and both of the two fixes on their own appear to >> >> >> fix the problems we've seen: >> >> >> >> >> >> 1. Fix a race in the tasklet where the interrupt setting the data >> >> >> error happens _just after_ we check for it, then we get a >> >> >> EVENT_XFER_COMPLETE. We fix this by repeating a bit of code. >> > I think repeating is not good approach to fix race. >> > In your case, XFER_COMPLETE preceded data error and DTO didn't come? >> > It seems strange case. >> > I want to know actual error value if you can reproduce. >> >> XFER_COMPLETE didn't necessarily precede data error. Imagine this scenario: >> >> 1. Check for data error: nope >> 2. Interrupt happens and we get a data error and immediately xfer complete >> 3. Check for xfer complete: yup >> >> That's the state that we are handling. >> >> The system that dw_mmc uses where the interrupt handler has no locking >> makes it incredibly difficult to get things right. Can you propose an >> alternate fix that would avoid the race? > Thank you for detailed scenario. > You're right. > Have you consider using spin_lock() in interrupt handler? > Then, we'll need to change spin_lock() to spin_lock_irqsave() in tasklet func. > And other locks in driver may need to be adjusted properly. I have certainly considered it and I think it's the right way to go, but I believe that this would be a pretty massive change to the design of dw_mmc. Someone appeared to try very hard not to use a spinlock in the interrupt handler and came up with the whole tasklet / pending events / completed events to deal with it. In this particular case it would be pretty easy to just add the spinlock around the data error / xfer complete checks, though once we have a spinlock here it seems like we'd want to start using it in other places. It would be confusing if the interrupt handler grabbed a spinlock the whole time but then only used it to protect a single small check. I'm really curious, though, why this driver can't just use a threaded irq handler and eliminate the whole interrupt / tasklet split. That seems saner in the long run. > To return above scenario: > 1. Check for data error: nope > 2. Check for xfer complete: nope -> escape tasklet. > 3. Interrupt happens and we get a data error and immediately xfer complete > 4. Check for data error (Again in tasklet) : yup > > How about this change? I'm not sure I understand. Are you suggesting a change to my code, or wondering how my code handles the above scenario? I think your scenario works find either with or without my patch. -Doug -- To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html