On Sun, 2009-03-29 at 01:24 -0700, Trent Piepho wrote: > On Sat, 28 Mar 2009, Andy Walls wrote: > > On Mon, 2009-03-23 at 06:52 -0700, Corey Taylor wrote: > > I found a race condition between the cx driver and the CX23418 firmware. > > I have a patch that mitigates the problem here: > > > > http://linuxtv.org/hg/~awalls/cx18/rev/9f5f44e0ce6c > > > [ We have to do this polling wait because there is a race with the > > firmware. Once we give it the SW1 interrupt above, it can wake up our > > waitq with an ack interrupt via the irq handler after we're ready to > > wait, but before we actually get put to sleep by schedule(). Loosing > > that race causes us to wait the entire timeout, waitng for a wakeup > > that's never going to come. ] Trent, First, thanks for the fresh perspective. > A race like this should be avoidable. The way it works is you do something > like this: > > /* 1 */ set_current_state(TASK_INTERRUPTIBLE); > /* 2 */ cx18_write_reg_expect(cx, irq, SW1_INT_SET, irq, irq); > /* 3 */ schedule(); > /* 4 */ ack_has_now_been_received(); I tried something like this in my second iteration, see below. (The patch I put in my repo was actually my third iteration.) > The race you are talking about is when the ack arrives between line 2 and > 3. If this happens here, the process' current state is changed to > TASK_RUNNING when the irq hander that receives the ack tries to wake our > process. If schedule() is called with the state set to TASK_RUNNING then > the process doesn't sleep. And thus there is no race. The key is that > preparing to sleep at line 1 happens before we start the event we want to > wait for at line 2. > > wait_event() should take care of this. wait_event(q, test) basically does: > > for(;;) { > // point A > add_me_to_waitqueue(q); > set_current_state(TASK_UNINTERRUPTIBLE); > if (test) > break; > // point B > schedule(); > } > clean_up_wait_stuff(); As you know, the condition is checked even before this loop is entered, to avoid even being even added to a waitqueue. (Thank God for ctags...) As you may have noticed, the original code was using wait_event_timeout() before like this: CX18_DEBUG_HI_IRQ("sending interrupt SW1: %x to send %s\n", irq, info->name); cx18_write_reg_expect(cx, irq, SW1_INT_SET, irq, irq); ret = wait_event_timeout( *waitq, cx18_readl(cx, &mb->ack) == cx18_readl(cx, &mb->request), timeout); Because waiting for the ack back is the right thing to do, but certainly waiting too long is not warranted. This gave me the occasional log message like this: 1: cx18-0: irq: sending interrupt SW1: 8 to send CX18_CPU_DE_SET_MDL 2: cx18-0: irq: received interrupts SW1: 0 SW2: 8 HW2: 0 3: cx18-0: irq: received interrupts SW1: 10000 SW2: 0 HW2: 0 4: cx18-0: warning: sending CX18_CPU_DE_SET_MDL timed out waiting 10 msecs for RPU acknowledgement Where line 1 is the driver notifiying the firmware with a SW1 interrupt. Line 2 is the firmware responding back to the cx18_irq_handler() with the Ack interrupt in SW2 (the flags match, 8 & 8, by design). Line 3 is an unrelated incoming video buffer notification for the cx18 driver. Line 4 is the wait_event_timeout() timing out. Since, I'm sending buffers back to the firmware on the read()-ing applications timeline, these delays caused playback problems. > If your event occurs and wake_up() is called at point A, then the test > should be true when it's checked and schedule() is never called. If the > event happens at point B, then the process' state will have been changed to > TASK_RUNNING by wake_up(), remember it's already on the waitqueue at this > point, and schedule() won't sleep. OK, for some reason, I thought schedule() and schedule_timeout() would go to sleep anyway. > I think what's probably happening is the test, cx18_readl(cx, &mb->ack) == > cx18_readl(cx, &mb->request), is somehow not true even though the ack has > been received. A PCI bus read error could be the culprit here. That's the only thing I can think of. We only get one notification via IRQ from the firmware. > Maybe a new request was added? No, I lock the respective epu2apu or epu2cpu mailboxes respectively with a mutex. > I think calling wait_event()'s with something that tests a hardware > register is a little iffy. It's better if the irq handler sets some driver > state flag (atomically!) that indicates the event you were waiting for has > happened and then you check that flag. I was toying with setting an atomic while in the IRQ handler. But then I realized when we get the ack interrupt, the firmware should actually be done. So really the wakeup() is the only indicator I really need. Checking for ack == req is just a formality I guess. There wasn't a wait_timeout(), so I had tried something like this in my first iteration: #define wait_event_oneshot_timeout(wq, condition, timeout) \ ({ \ long __ret = timeout; \ if (!(condition)) { \ DEFINE_WAIT(__wait); \ prepare_to_wait(&wq, &__wait, TASK_UNINTERRUPTIBLE); \ if (!(condition)) { \ __ret = schedule_timeout(__ret); \ } \ finish_wait(&wq, &__wait); \ } \ __ret; \ }) ... cx18_write_reg_expect(cx, irq, SW1_INT_SET, irq, irq); ret = wait_event_oneshot_timeout(*waitq, cx18_readl(cx, &mb->request) == cx18_readl(cx, &mb->ack), timeout); ... It didn't work. Sometimes it would wait the whole timeout, but the cx18_irq_handler() had gotten an ack interrupt. Then I tried: // FIXME break into several small timeouts/poll // or use an atomic to communicate completion CX18_DEBUG_HI_IRQ("sending interrupt SW1: %x to send %s\n", irq, info->name); ret = timeout; prepare_to_wait(waitq, &w, TASK_UNINTERRUPTIBLE); cx18_write_reg_expect(cx, irq, SW1_INT_SET, irq, irq); /* * Will we schedule in time, before the IRQ handler wakes up our waitq? * Who knows?! How exciting! Let the race begin! */ if (req != cx18_readl(cx, &mb->ack)) ret = schedule_timeout(timeout); finish_wait(waitq, &w); It didn't work, sometimes it would wait the whole timeout even though the ack interrupt had arrived. Again at the time, I was under the impression that schedule_timeout() would go to sleep anyway even if we had been awakened (thus my sarcastic comments). Did I miss anything with either of those two previous tries? I guess I need to dig into the guts of schedule_timeout() to convince myself that the process won't be put to sleep. I'm using Fedora 10 BTW: Linux palomino.walls.org 2.6.27.9-159.fc10.x86_64 #1 SMP Tue Dec 16 14:47:52 EST 2008 x86_64 x86_64 x86_64 GNU/Linux Thanks. Regards, Andy -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html