Re: [PATCH v8 2/3] CMDQ: Mediatek CMDQ driver

Horng-Shyang Liao <hs.liao@xxxxxxxxxxxx> · Tue, 21 Jun 2016 13:52:38 +0800

On Fri, 2016-06-17 at 17:57 +0200, Matthias Brugger wrote:
> 
> On 17/06/16 10:28, Horng-Shyang Liao wrote:
> > Hi Matthias,
> >
> > On Tue, 2016-06-14 at 20:07 +0800, Horng-Shyang Liao wrote:
> >> Hi Matthias,
> >>
> >> On Tue, 2016-06-14 at 12:17 +0200, Matthias Brugger wrote:
> >>>
> >>> On 14/06/16 09:44, Horng-Shyang Liao wrote:
> >>>> Hi Matthias,
> >>>>
> >>>> On Wed, 2016-06-08 at 17:35 +0200, Matthias Brugger wrote:
> >>>>>
> >>>>> On 08/06/16 14:25, Horng-Shyang Liao wrote:
> >>>>>> Hi Matthias,
> >>>>>>
> >>>>>> On Wed, 2016-06-08 at 12:45 +0200, Matthias Brugger wrote:
> >>>>>>>
> >>>>>>> On 08/06/16 07:40, Horng-Shyang Liao wrote:
> >>>>>>>> Hi Matthias,
> >>>>>>>>
> >>>>>>>> On Tue, 2016-06-07 at 18:59 +0200, Matthias Brugger wrote:
> >>>>>>>>>
> >>>>>>>>> On 03/06/16 15:11, Matthias Brugger wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> [...]
> >>>>>>>>>
> >>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>> +            smp_mb(); /* modify jump before enable thread */
> >>>>>>>>>>>>>>>>>>> +        }
> >>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>> +        cmdq_thread_writel(thread, task->pa_base +
> >>>>>>>>>>>>>>>>>>> task->command_size,
> >>>>>>>>>>>>>>>>>>> +                   CMDQ_THR_END_ADDR);
> >>>>>>>>>>>>>>>>>>> +        cmdq_thread_resume(thread);
> >>>>>>>>>>>>>>>>>>> +    }
> >>>>>>>>>>>>>>>>>>> +    list_move_tail(&task->list_entry, &thread->task_busy_list);
> >>>>>>>>>>>>>>>>>>> +    spin_unlock_irqrestore(&cmdq->exec_lock, flags);
> >>>>>>>>>>>>>>>>>>> +}
> >>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>> +static void cmdq_handle_error_done(struct cmdq *cmdq,
> >>>>>>>>>>>>>>>>>>> +                   struct cmdq_thread *thread, u32 irq_flag)
> >>>>>>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>>>>>> +    struct cmdq_task *task, *tmp, *curr_task = NULL;
> >>>>>>>>>>>>>>>>>>> +    u32 curr_pa;
> >>>>>>>>>>>>>>>>>>> +    struct cmdq_cb_data cmdq_cb_data;
> >>>>>>>>>>>>>>>>>>> +    bool err;
> >>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>> +    if (irq_flag & CMDQ_THR_IRQ_ERROR)
> >>>>>>>>>>>>>>>>>>> +        err = true;
> >>>>>>>>>>>>>>>>>>> +    else if (irq_flag & CMDQ_THR_IRQ_DONE)
> >>>>>>>>>>>>>>>>>>> +        err = false;
> >>>>>>>>>>>>>>>>>>> +    else
> >>>>>>>>>>>>>>>>>>> +        return;
> >>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>> +    curr_pa = cmdq_thread_readl(thread, CMDQ_THR_CURR_ADDR);
> >>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>> +    list_for_each_entry_safe(task, tmp, &thread->task_busy_list,
> >>>>>>>>>>>>>>>>>>> +                 list_entry) {
> >>>>>>>>>>>>>>>>>>> +        if (curr_pa >= task->pa_base &&
> >>>>>>>>>>>>>>>>>>> +            curr_pa < (task->pa_base + task->command_size))
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> What are you checking here? It seems as if you make some implcit
> >>>>>>>>>>>>>>>>>> assumptions about pa_base and the order of execution of
> >>>>>>>>>>>>>>>>>> commands in the
> >>>>>>>>>>>>>>>>>> thread. Is it save to do so? Does dma_alloc_coherent give any
> >>>>>>>>>>>>>>>>>> guarantees
> >>>>>>>>>>>>>>>>>> about dma_handle?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1. Check what is the current running task in this GCE thread.
> >>>>>>>>>>>>>>>>> 2. Yes.
> >>>>>>>>>>>>>>>>> 3. Yes, CMDQ doesn't use iommu, so physical address is continuous.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Yes, physical addresses might be continous, but AFAIK there is no
> >>>>>>>>>>>>>>>> guarantee that the dma_handle address is steadily growing, when
> >>>>>>>>>>>>>>>> calling
> >>>>>>>>>>>>>>>> dma_alloc_coherent. And if I understand the code correctly, you
> >>>>>>>>>>>>>>>> use this
> >>>>>>>>>>>>>>>> assumption to decide if the task picked from task_busy_list is
> >>>>>>>>>>>>>>>> currently
> >>>>>>>>>>>>>>>> executing. So I think this mecanism is not working.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I don't use dma_handle address, and just use physical addresses.
> >>>>>>>>>>>>>>>        From CPU's point of view, tasks are linked by the busy list.
> >>>>>>>>>>>>>>>        From GCE's point of view, tasks are linked by the JUMP command.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In which cases does the HW thread raise an interrupt.
> >>>>>>>>>>>>>>>> In case of error. When does CMDQ_THR_IRQ_DONE get raised?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> GCE will raise interrupt if any task is done or error.
> >>>>>>>>>>>>>>> However, GCE is fast, so CPU may get multiple done tasks
> >>>>>>>>>>>>>>> when it is running ISR.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> In case of error, that GCE thread will pause and raise interrupt.
> >>>>>>>>>>>>>>> So, CPU may get multiple done tasks and one error task.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I think we should reimplement the ISR mechanism. Can't we just read
> >>>>>>>>>>>>>> CURR_IRQ_STATUS and THR_IRQ_STATUS in the handler and leave
> >>>>>>>>>>>>>> cmdq_handle_error_done to the thread_fn? You will need to pass
> >>>>>>>>>>>>>> information from the handler to thread_fn, but that shouldn't be an
> >>>>>>>>>>>>>> issue. AFAIK interrupts are disabled in the handler, so we should stay
> >>>>>>>>>>>>>> there as short as possible. Traversing task_busy_list is expensive, so
> >>>>>>>>>>>>>> we need to do it in a thread context.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Actually, our initial implementation is similar to your suggestion,
> >>>>>>>>>>>>> but display needs CMDQ to return callback function very precisely,
> >>>>>>>>>>>>> else display will drop frame.
> >>>>>>>>>>>>> For display, CMDQ interrupt will be raised every 16 ~ 17 ms,
> >>>>>>>>>>>>> and CMDQ needs to call callback function in ISR.
> >>>>>>>>>>>>> If we defer callback to workqueue, the time interval may be larger than
> >>>>>>>>>>>>> 32 ms.sometimes.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think the problem is, that you implemented the workqueue as a ordered
> >>>>>>>>>>>> workqueue, so there is no parallel processing. I'm still not sure why
> >>>>>>>>>>>> you need the workqueue to be ordered. Can you please explain.
> >>>>>>>>>>>
> >>>>>>>>>>> The order should be kept.
> >>>>>>>>>>> Let me use mouse cursor as an example.
> >>>>>>>>>>> If task 1 means move mouse cursor to point A, task 2 means point B,
> >>>>>>>>>>> and task 3 means point C, our expected result is A -> B -> C.
> >>>>>>>>>>> If the order is not kept, the result could become A -> C -> B.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Got it, thanks for the clarification.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I think a way to get rid of the workqueue is to use a timer, which gets
> >>>>>>>>> programmed to the time a timeout in the first task in the busy list
> >>>>>>>>> would happen. Everytime we update the busy list (e.g. because of task
> >>>>>>>>> got finished by the thread), we update the timer. When the timer
> >>>>>>>>> triggers, which hopefully won't happen too often, we return timeout on
> >>>>>>>>> the busy list elements, until the time is lower then the actual time.
> >>>>>>>>>
> >>>>>>>>> At least with this we can reduce the data structures in this driver and
> >>>>>>>>> make it more lightweight.
> >>>>>>>>
> >>>>>>>>     From my understanding, your proposed method can handle timeout case.
> >>>>>>>>
> >>>>>>>> However, the workqueue is also in charge of releasing tasks.
> >>>>>>>> Do you take releasing tasks into consideration by using the proposed
> >>>>>>>> timer method?
> >>>>>>>> Furthermore, I think the code will become more complex if we also use
> >>>>>>>> timer to implement releasing tasks.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Can't we call
> >>>>>>>             clk_disable_unprepare(cmdq->clock);
> >>>>>>>             cmdq_task_release(task);
> >>>>>>> after invoking the callback?
> >
> > After I put clk_disable_unprepare(cmdq->clock) into ISR, I encounter
> > another BUG.
> >
> > (Quote some Linux 4.7 source code.)
> >
> >   605 void clk_unprepare(struct clk *clk)
> >   606 {
> >   607         if (IS_ERR_OR_NULL(clk))
> >   608                 return;
> >   609
> >   610         clk_prepare_lock();                      // <-- Here
> >   611         clk_core_unprepare(clk->core);
> >   612         clk_prepare_unlock();
> >   613 }
> >   614 EXPORT_SYMBOL_GPL(clk_unprepare);
> >
> >    91 static void clk_prepare_lock(void)
> >    92 {
> >    93         if (!mutex_trylock(&prepare_lock)) {     // <-- Here
> >    94                 if (prepare_owner == current) {
> >    95                         prepare_refcnt++;
> >    96                         return;
> >    97                 }
> >    98                 mutex_lock(&prepare_lock);
> >    99         }
> >   100         WARN_ON_ONCE(prepare_owner != NULL);
> >   101         WARN_ON_ONCE(prepare_refcnt != 0);
> >   102         prepare_owner = current;
> >   103         prepare_refcnt = 1;
> >   104 }
> >
> > So, 'unprepare' can sleep and cannot be put into ISR.
> > I also try to put it into a timer, but the error is the same
> > since timer callback is executed by softirq.
> >
> > We need clk_disable_unprepare() since it can save power consumption
> > in idle.
> 
> We can call clk_prepare in probe and then use clk_enable/clk_disable, 
> which don't sleep.
> 
> Regards,
> Matthias

Hi Matthias,

Because clock gate and MUX are controlled by clk_enable/clk_disable,
and PLL is controlled by clk_prepare/clk_unprepare,
I still need to call clk_unprepare.

After I remove releasing buffer, releasing task, and timeout task from
work, the work can be detached from task.

Therefore, I can use the following flow to reduce the number of works.

if task_busy_list from empty to non-empty
	clk_prepare_enable
if task_busy_list from non-empty to empty
	in ISR, add work for clk_disable_unprepare

What do you think of this solution?

Thanks,
HS

> > Therefore, I plan to
> > (1) move releasing buffer and task into ISR,
> > (2) move timeout into timer, and
> > (3) keep workqueue for clk_disable_unprepare().
> >
> > What do you think?
> >
> > Thanks,
> > HS
> >
> >>>>>>
> >>>>>> Do you mean just call these two functions in ISR?
> >>>>>> My major concern is dma_free_coherent() and kfree() in
> >>>>>> cmdq_task_release(task).
> >>>>>
> >>>>> Why do we need the dma calls at all? Can't we just calculate the
> >>>>> physical address using __pa(x)?
> >>>>
> >>>> I prefer to use dma_map_single/dma_unmap_single.
> >>>>
> >>>
> >>> Can you please elaborate why you need this. We don't do dma, so we
> >>> should not use dma memory for this.
> >>
> >> We need a buffer to share between CPU and GCE, so we do need DMA.
> >> CPU is in charge of writing GCE commands into this buffer.
> >> GCE is in charge of reading and running GCE commands from this buffer.
> >> When we chain CMDQ tasks, we also need to modify GCE JUMP command.
> >> Therefore, I prefer to use dma_alloc_coherent and dma_free_coherent.
> >>
> >> However, if we want to use timer to handle timeout, we need to release
> >> memory in ISR.
> >> In this case, using kmalloc/kfree + dma_map_single/dma_unmap_single
> >> instead of dma_alloc_coherent/dma_free_coherent is an alternative
> >> solution, but taking care the synchronization between cache and memory
> >> is the expected overhead.
> >>
> >>>>>> Therefore, your suggestion is to use GFP_ATOMIC for both
> >>>>>> dma_alloc_coherent() and kzalloc(). Right?
> >>>>>
> >>>>> I don't think we need GFP_ATOMIC, the critical path will just free the
> >>>>> memory.
> >>>>
> >>>> I tested these two functions, and kfree was safe.
> >>>> However, dma_free_coherent raised BUG.
> >>>> BUG: failure at
> >>>> /mnt/host/source/src/third_party/kernel/v3.18/mm/vmalloc.c:1514/vunmap()!
> >>>
> >>> Just a general hint. Please try to evaluate on a recent kernel. It looks
> >>> like as if you tried this on a v3.18 based one.
> >>
> >> This driver should be backward compatible to v3.18 for a MTK project.
> >>
> >>> Best regards,
> >>> Matthias
> >>
> >> Thanks,
> >> HS
> >>
> >>>> 1512 void vunmap(const void *addr)
> >>>> 1513 {
> >>>> 1514         BUG_ON(in_interrupt());		// <-- here
> >>>> 1515         might_sleep();
> >>>> 1516         if (addr)
> >>>> 1517                 __vunmap(addr, 0);
> >>>> 1518 }
> >>>> 1519 EXPORT_SYMBOL(vunmap);
> >>>>
> >>>> Therefore, I plan to use kmalloc + dma_map_single instead of
> >>>> dma_alloc_coherent, and dma_unmap_single + kfree instead of
> >>>> dma_free_coherent.
> >>>>
> >>>> What do you think about the function replacement?
> >>>>
> >>>>>> If so, I can try to implement timeout by timer, and discuss with you
> >>>>>> if I have further questions.
> >>>>>>
> >>>>>
> >>>>> Sounds good :)
> >>>>>
> >>>>> Thanks,
> >>>>> Matthias
> >>>>
> >>>> Thanks,
> >>>> HS
> >>>>
> >>>>>>> Regrading the clock, wouldn't it be easier to handle the clock
> >>>>>>> enable/disable depending on the state of task_busy_list? I suppose we
> >>>>>>> can't as we would need to check the task_busy_list of all threads, right?
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Matthias
> >>>>>>
> >>>>>> Thanks,
> >>>>>> HS

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html