Re: [PATCH v8 2/3] CMDQ: Mediatek CMDQ driver

Horng-Shyang Liao <hs.liao@xxxxxxxxxxxx> · Tue, 14 Jun 2016 20:07:43 +0800

Hi Matthias,

On Tue, 2016-06-14 at 12:17 +0200, Matthias Brugger wrote:
> 
> On 14/06/16 09:44, Horng-Shyang Liao wrote:
> > Hi Matthias,
> >
> > On Wed, 2016-06-08 at 17:35 +0200, Matthias Brugger wrote:
> >>
> >> On 08/06/16 14:25, Horng-Shyang Liao wrote:
> >>> Hi Matthias,
> >>>
> >>> On Wed, 2016-06-08 at 12:45 +0200, Matthias Brugger wrote:
> >>>>
> >>>> On 08/06/16 07:40, Horng-Shyang Liao wrote:
> >>>>> Hi Matthias,
> >>>>>
> >>>>> On Tue, 2016-06-07 at 18:59 +0200, Matthias Brugger wrote:
> >>>>>>
> >>>>>> On 03/06/16 15:11, Matthias Brugger wrote:
> >>>>>>>
> >>>>>>>
> >>>>>> [...]
> >>>>>>
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +            smp_mb(); /* modify jump before enable thread */
> >>>>>>>>>>>>>>>> +        }
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +        cmdq_thread_writel(thread, task->pa_base +
> >>>>>>>>>>>>>>>> task->command_size,
> >>>>>>>>>>>>>>>> +                   CMDQ_THR_END_ADDR);
> >>>>>>>>>>>>>>>> +        cmdq_thread_resume(thread);
> >>>>>>>>>>>>>>>> +    }
> >>>>>>>>>>>>>>>> +    list_move_tail(&task->list_entry, &thread->task_busy_list);
> >>>>>>>>>>>>>>>> +    spin_unlock_irqrestore(&cmdq->exec_lock, flags);
> >>>>>>>>>>>>>>>> +}
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +static void cmdq_handle_error_done(struct cmdq *cmdq,
> >>>>>>>>>>>>>>>> +                   struct cmdq_thread *thread, u32 irq_flag)
> >>>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>>> +    struct cmdq_task *task, *tmp, *curr_task = NULL;
> >>>>>>>>>>>>>>>> +    u32 curr_pa;
> >>>>>>>>>>>>>>>> +    struct cmdq_cb_data cmdq_cb_data;
> >>>>>>>>>>>>>>>> +    bool err;
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +    if (irq_flag & CMDQ_THR_IRQ_ERROR)
> >>>>>>>>>>>>>>>> +        err = true;
> >>>>>>>>>>>>>>>> +    else if (irq_flag & CMDQ_THR_IRQ_DONE)
> >>>>>>>>>>>>>>>> +        err = false;
> >>>>>>>>>>>>>>>> +    else
> >>>>>>>>>>>>>>>> +        return;
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +    curr_pa = cmdq_thread_readl(thread, CMDQ_THR_CURR_ADDR);
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +    list_for_each_entry_safe(task, tmp, &thread->task_busy_list,
> >>>>>>>>>>>>>>>> +                 list_entry) {
> >>>>>>>>>>>>>>>> +        if (curr_pa >= task->pa_base &&
> >>>>>>>>>>>>>>>> +            curr_pa < (task->pa_base + task->command_size))
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> What are you checking here? It seems as if you make some implcit
> >>>>>>>>>>>>>>> assumptions about pa_base and the order of execution of
> >>>>>>>>>>>>>>> commands in the
> >>>>>>>>>>>>>>> thread. Is it save to do so? Does dma_alloc_coherent give any
> >>>>>>>>>>>>>>> guarantees
> >>>>>>>>>>>>>>> about dma_handle?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. Check what is the current running task in this GCE thread.
> >>>>>>>>>>>>>> 2. Yes.
> >>>>>>>>>>>>>> 3. Yes, CMDQ doesn't use iommu, so physical address is continuous.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yes, physical addresses might be continous, but AFAIK there is no
> >>>>>>>>>>>>> guarantee that the dma_handle address is steadily growing, when
> >>>>>>>>>>>>> calling
> >>>>>>>>>>>>> dma_alloc_coherent. And if I understand the code correctly, you
> >>>>>>>>>>>>> use this
> >>>>>>>>>>>>> assumption to decide if the task picked from task_busy_list is
> >>>>>>>>>>>>> currently
> >>>>>>>>>>>>> executing. So I think this mecanism is not working.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I don't use dma_handle address, and just use physical addresses.
> >>>>>>>>>>>>       From CPU's point of view, tasks are linked by the busy list.
> >>>>>>>>>>>>       From GCE's point of view, tasks are linked by the JUMP command.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> In which cases does the HW thread raise an interrupt.
> >>>>>>>>>>>>> In case of error. When does CMDQ_THR_IRQ_DONE get raised?
> >>>>>>>>>>>>
> >>>>>>>>>>>> GCE will raise interrupt if any task is done or error.
> >>>>>>>>>>>> However, GCE is fast, so CPU may get multiple done tasks
> >>>>>>>>>>>> when it is running ISR.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In case of error, that GCE thread will pause and raise interrupt.
> >>>>>>>>>>>> So, CPU may get multiple done tasks and one error task.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I think we should reimplement the ISR mechanism. Can't we just read
> >>>>>>>>>>> CURR_IRQ_STATUS and THR_IRQ_STATUS in the handler and leave
> >>>>>>>>>>> cmdq_handle_error_done to the thread_fn? You will need to pass
> >>>>>>>>>>> information from the handler to thread_fn, but that shouldn't be an
> >>>>>>>>>>> issue. AFAIK interrupts are disabled in the handler, so we should stay
> >>>>>>>>>>> there as short as possible. Traversing task_busy_list is expensive, so
> >>>>>>>>>>> we need to do it in a thread context.
> >>>>>>>>>>
> >>>>>>>>>> Actually, our initial implementation is similar to your suggestion,
> >>>>>>>>>> but display needs CMDQ to return callback function very precisely,
> >>>>>>>>>> else display will drop frame.
> >>>>>>>>>> For display, CMDQ interrupt will be raised every 16 ~ 17 ms,
> >>>>>>>>>> and CMDQ needs to call callback function in ISR.
> >>>>>>>>>> If we defer callback to workqueue, the time interval may be larger than
> >>>>>>>>>> 32 ms.sometimes.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I think the problem is, that you implemented the workqueue as a ordered
> >>>>>>>>> workqueue, so there is no parallel processing. I'm still not sure why
> >>>>>>>>> you need the workqueue to be ordered. Can you please explain.
> >>>>>>>>
> >>>>>>>> The order should be kept.
> >>>>>>>> Let me use mouse cursor as an example.
> >>>>>>>> If task 1 means move mouse cursor to point A, task 2 means point B,
> >>>>>>>> and task 3 means point C, our expected result is A -> B -> C.
> >>>>>>>> If the order is not kept, the result could become A -> C -> B.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Got it, thanks for the clarification.
> >>>>>>>
> >>>>>>
> >>>>>> I think a way to get rid of the workqueue is to use a timer, which gets
> >>>>>> programmed to the time a timeout in the first task in the busy list
> >>>>>> would happen. Everytime we update the busy list (e.g. because of task
> >>>>>> got finished by the thread), we update the timer. When the timer
> >>>>>> triggers, which hopefully won't happen too often, we return timeout on
> >>>>>> the busy list elements, until the time is lower then the actual time.
> >>>>>>
> >>>>>> At least with this we can reduce the data structures in this driver and
> >>>>>> make it more lightweight.
> >>>>>
> >>>>>    From my understanding, your proposed method can handle timeout case.
> >>>>>
> >>>>> However, the workqueue is also in charge of releasing tasks.
> >>>>> Do you take releasing tasks into consideration by using the proposed
> >>>>> timer method?
> >>>>> Furthermore, I think the code will become more complex if we also use
> >>>>> timer to implement releasing tasks.
> >>>>>
> >>>>
> >>>> Can't we call
> >>>>            clk_disable_unprepare(cmdq->clock);
> >>>>            cmdq_task_release(task);
> >>>> after invoking the callback?
> >>>
> >>> Do you mean just call these two functions in ISR?
> >>> My major concern is dma_free_coherent() and kfree() in
> >>> cmdq_task_release(task).
> >>
> >> Why do we need the dma calls at all? Can't we just calculate the
> >> physical address using __pa(x)?
> >
> > I prefer to use dma_map_single/dma_unmap_single.
> >
> 
> Can you please elaborate why you need this. We don't do dma, so we 
> should not use dma memory for this.

We need a buffer to share between CPU and GCE, so we do need DMA.
CPU is in charge of writing GCE commands into this buffer.
GCE is in charge of reading and running GCE commands from this buffer.
When we chain CMDQ tasks, we also need to modify GCE JUMP command.
Therefore, I prefer to use dma_alloc_coherent and dma_free_coherent.

However, if we want to use timer to handle timeout, we need to release
memory in ISR.
In this case, using kmalloc/kfree + dma_map_single/dma_unmap_single
instead of dma_alloc_coherent/dma_free_coherent is an alternative
solution, but taking care the synchronization between cache and memory
is the expected overhead.

> >>> Therefore, your suggestion is to use GFP_ATOMIC for both
> >>> dma_alloc_coherent() and kzalloc(). Right?
> >>
> >> I don't think we need GFP_ATOMIC, the critical path will just free the
> >> memory.
> >
> > I tested these two functions, and kfree was safe.
> > However, dma_free_coherent raised BUG.
> > BUG: failure at
> > /mnt/host/source/src/third_party/kernel/v3.18/mm/vmalloc.c:1514/vunmap()!
> 
> Just a general hint. Please try to evaluate on a recent kernel. It looks 
> like as if you tried this on a v3.18 based one.

This driver should be backward compatible to v3.18 for a MTK project.

> Best regards,
> Matthias

Thanks,
HS

> > 1512 void vunmap(const void *addr)
> > 1513 {
> > 1514         BUG_ON(in_interrupt());		// <-- here
> > 1515         might_sleep();
> > 1516         if (addr)
> > 1517                 __vunmap(addr, 0);
> > 1518 }
> > 1519 EXPORT_SYMBOL(vunmap);
> >
> > Therefore, I plan to use kmalloc + dma_map_single instead of
> > dma_alloc_coherent, and dma_unmap_single + kfree instead of
> > dma_free_coherent.
> >
> > What do you think about the function replacement?
> >
> >>> If so, I can try to implement timeout by timer, and discuss with you
> >>> if I have further questions.
> >>>
> >>
> >> Sounds good :)
> >>
> >> Thanks,
> >> Matthias
> >
> > Thanks,
> > HS
> >
> >>>> Regrading the clock, wouldn't it be easier to handle the clock
> >>>> enable/disable depending on the state of task_busy_list? I suppose we
> >>>> can't as we would need to check the task_busy_list of all threads, right?
> >>>>
> >>>> Regards,
> >>>> Matthias
> >>>
> >>> Thanks,
> >>> HS
> >>>
> >
> >

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html