On Mon, 2009-03-30 at 13:54 -0700, Trent Piepho wrote: > On Sun, 29 Mar 2009, Andy Walls wrote: > > On Sun, 2009-03-29 at 01:24 -0700, Trent Piepho wrote: > > > wait_event() should take care of this. wait_event(q, test) basically does: > > > > > > for(;;) { > > > // point A > > > add_me_to_waitqueue(q); > > > set_current_state(TASK_UNINTERRUPTIBLE); > > > if (test) > > > break; > > > // point B > > > schedule(); > > > } > > > clean_up_wait_stuff(); > > > > As you know, the condition is checked even before this loop is entered, > > to avoid even being even added to a waitqueue. (Thank God for ctags...) > > I think the initial check of the condition is just an optimization and > everything will still work without it. Seeing as all this is inlined, I > wonder if it's a good optimization... I guess it depends on how expensive prepare_to_wait() is since it acquires a spinlock. But now that you mention inlining. I now have to check if this sequence that I thought didn't work: prepare_to_wait(waitq, &w, TASK_UNINTERRUPTIBLE); cx18_write_reg_expect(cx, irq, SW1_INT_SET, irq, irq); if (req != cx18_readl(cx, &mb->ack)) ret = schedule_timeout(timeout); finish_wait(waitq, &w); is actually getting reordered. I bet constituent parts of the first two lines may be. That would certainly explain things. > > As you may have noticed, the original code was using > > wait_event_timeout() before like this: > > > > CX18_DEBUG_HI_IRQ("sending interrupt SW1: %x to send %s\n", > > irq, info->name); > > cx18_write_reg_expect(cx, irq, SW1_INT_SET, irq, irq); > > > > ret = wait_event_timeout( > > *waitq, > > cx18_readl(cx, &mb->ack) == cx18_readl(cx, &mb->request), > > timeout); > > > > Because waiting for the ack back is the right thing to do, but certainly > > waiting too long is not warranted. > > > > This gave me the occasional log message like this: > > > > 1: cx18-0: irq: sending interrupt SW1: 8 to send CX18_CPU_DE_SET_MDL > > 2: cx18-0: irq: received interrupts SW1: 0 SW2: 8 HW2: 0 > > 3: cx18-0: irq: received interrupts SW1: 10000 SW2: 0 HW2: 0 > > 4: cx18-0: warning: sending CX18_CPU_DE_SET_MDL timed out waiting 10 msecs for RPU acknowledgement > > > > Where line 1 is the driver notifiying the firmware with a SW1 interrupt. > > Line 2 is the firmware responding back to the cx18_irq_handler() with > > the Ack interrupt in SW2 (the flags match, 8 & 8, by design). > > Line 3 is an unrelated incoming video buffer notification for the cx18 > > driver. > > Line 4 is the wait_event_timeout() timing out. > > Could it be that the wait_event doesn't actually run and check its > condition until _after_ line 3? SW2: 8 is a firmware response to us setting the outgoing SW1 to 8. SW2 getting cleared is done by cx18_irq_handler and its helper: static void xpu_ack(struct cx18 *cx, u32 sw2) { // Wake up the process waiting on the EPU -> CPU mailbox ack // This one has a flag value of 8 if (sw2 & IRQ_CPU_TO_EPU_ACK) wake_up(&cx->mb_cpu_waitq); // Wake up the process waiting on the EPU -> APU mailbox ack if (sw2 & IRQ_APU_TO_EPU_ACK) wake_up(&cx->mb_apu_waitq); } irqreturn_t cx18_irq_handler(int irq, void *dev_id) { struct cx18 *cx = (struct cx18 *)dev_id; u32 sw1, sw2, hw2; // Get the status of SW2 sw2 = cx18_read_reg(cx, SW2_INT_STATUS) & cx->sw2_irq_mask; // Clear any status of SW2 that were found set if (sw2) cx18_write_reg_expect(cx, sw2, SW2_INT_STATUS, ~sw2, sw2); // Act on any SW2 interrupts found set if (sw2) xpu_ack(cx, sw2); return (sw1 || sw2 || hw2) ? IRQ_HANDLED : IRQ_NONE; } > In that case SW2 != 8 and so it goes back > to sleep? If SW2 got set back to 0 (or != 8), that means we cleared it ourselves. We only do that slightly prior to waking up the proper waitq. > Calling wake_up() just makes the processes on the waitq > runnable, they don't actually run until later, possibly much later. Hmm. Maybe then we're yielding the processor and then simply running much later. If that's the case, I may be stuck with shorter timeouts with polls. :( > > > If your event occurs and wake_up() is called at point A, then the test > > > should be true when it's checked and schedule() is never called. If the > > > event happens at point B, then the process' state will have been changed to > > > TASK_RUNNING by wake_up(), remember it's already on the waitqueue at this > > > point, and schedule() won't sleep. > > > > OK, for some reason, I thought schedule() and schedule_timeout() would > > go to sleep anyway. > > AFAIK, they'll still cause the kernel schedule a process. Maybe a > different process. But the original process is still in TASK_RUNNING state > and so still in the run queue and will get run again. If it was in > TASK_(UN)INTERRUPTIBLE state then it wouldn't be in the run queue and > wouldn't run again until something woke it up. I'm going to have to figure out how to profile things. > > > I think what's probably happening is the test, cx18_readl(cx, &mb->ack) == > > > cx18_readl(cx, &mb->request), is somehow not true even though the ack has > > > been received. > > > > A PCI bus read error could be the culprit here. That's the only thing I > > can think of. We only get one notification via IRQ from the firmware. > > > > > > > Maybe a new request was added? > > > > No, I lock the respective epu2apu or epu2cpu mailboxes respectively with > > a mutex. > > But in your log: > > 1: cx18-0: irq: sending interrupt SW1: 8 to send CX18_CPU_DE_SET_MDL > > 2: cx18-0: irq: received interrupts SW1: 0 SW2: 8 HW2: 0 > > 3: cx18-0: irq: received interrupts SW1: 10000 SW2: 0 HW2: 0 > > 4: cx18-0: warning: sending CX18_CPU_DE_SET_MDL timed out waiting 10 msecs for RPU acknowledgement > > Isn't the wait_event_timeout() waiting until line 4? And doesn't line 3 > mean something has changed the registers? Changed them before the > wait_event finished? First a note: There are four mailboxes the driver cares about EPU -> CPU, CPU -> EPU, EPU -> APU, and APU -> EPU. We mostly care about the EPU/CPU mailboxes. Line 1 indicates an outgoing command in the EPU -> CPU mailbox Line 2 indicates an ack was placed in the EPU -> CPU mailbox Line 3 indicates an incoming command in the CPU -> EPU mailbox Line 4 indicates the async notification in Line 2 being missed by us somehow So yes, wait_event_timeout() is waiting until line 4. yes, line 3 means we cleared the SW2 register when we got line 2, yes, line 3 also means the firmware is sending us a full capture buffer with information in the CPU -> EPU mailbox (not the mbox at issue) But, line 3 does not mean the EPU -> CPU mailbox ack changed, Line 2 tells us that should be the case (different mailboxes) The process of the IRQ handler clearing SW2 should have resulted in the IRQ handler also sending a wake_up(). It doesn't appear to happen a small percentage of the time. > > > I think calling wait_event()'s with something that tests a hardware > > > register is a little iffy. It's better if the irq handler sets some driver > > > state flag (atomically!) that indicates the event you were waiting for has > > > happened and then you check that flag. > > > > I was toying with setting an atomic while in the IRQ handler. But then > > I realized when we get the ack interrupt, the firmware should actually > > be done. So really the wakeup() is the only indicator I really need. > > Checking for ack == req is just a formality I guess. > > If you use an interruptible timeout, then you could get interrupted with a > signal before the irq handler has woken you. I used UNITERRUPTIBLE here because it doesn't quite make sense to inidcate -ERESTARTSYS back to the caller in all circumstances. This function can be called by read() to send a series of commands to start a capture or to just send back a drained buffer; poll() to send a series of commands to start a capture; or by the work handler to return a drained buffer. > > There wasn't a wait_timeout(), so I had tried something like this in my > > first iteration: > > It's called sleep_on_timeout(q, timeout). I was deterred by the big scary comment: /* * These are the old interfaces to sleep waiting for an event. * They are racy. DO NOT use them, use the wait_event* interfaces above. * We plan to remove these interfaces. */ extern void sleep_on(wait_queue_head_t *q); extern long sleep_on_timeout(wait_queue_head_t *q, signed long timeout); Again, thanks for helping me think this through. It's obvious I have more testing and troubleshooting to do. My first step will be to check to see if compiler reordering screwed up one of my prior patch attempts. Regards, Andy -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html