Re: Discuss the multi-core media scheduler

Nicolas Dufresne <nicolas@xxxxxxxxxxxx> · Tue, 30 Apr 2024 12:46:31 -0400

Hi Daniel,

Le dimanche 28 avril 2024 à 15:26 -0300, Daniel Almeida a écrit :
> Hi everyone,
> 
> There seems to be a few unsolved problems in the mem2mem framework, one of
> which is the lack of support for architectures with multiple heterogeneous
> cores. For example, it is currently impossible to describe Mediatek's LAT and
> CORE cores to the framework as two independent units to be scheduled. This means
> that, at all times, one unit is idle while the other one is working.
> 
> I know that this is not the only problem with m2m, but it is where I'd like to
> start the discussion. Feel free to add your own requirements to the thread.
> 
> My proposed solution is to add a new iteration of mem2mem, which I have named
> the Multi-core Media Scheduler for the lack of a better term.
> 
> Please note that I will use the terms input/output queues in place of
> output/capture for the sake of readability.

There is one use case that isn't covered here that we really need to move
forward on RPi4/5 is cores that can execute multiple task at once.

In the case of Argon HEVC decoder on the Pi, the Entropy decoder and the
Rescontruction is ran in parallel, but the two function are using the same
trigger/irq pair.

In short, we need to be able to (if there is enough data in the vb2 queue) to
schedule two consecutive jobs at once. On a timeline:

----------------------------------------------------->
[entropy0][no decoder]
                      [entropy1][decode0]
                                         [entropy2][decode1]

Perhaps it already fits in the RFC, but it wasn't expressed clearly as a use
case. For real-time reason, its not really driver responsibility to wait for
buffers to be queued, and a no-op can happen in any of the two functions. Also,
I believe you can mix entropy decoding from one stream, while decoding a frame
from another stream (another video session / m2m ctx).

Nicolas

> 
> -------------------------------------------------------------------------------
> 
> The basic idea is to have a core as the basic entity to be scheduled, with its
> own input and output VB2 queues. This default will be identical to what we have
> today in m2m.
> 
>  input        output
> <----- core ----->
> 
> In all cases, this will be the only interface that the framework will expose to
> the outside world. The complexity to handle multiple cores will be hidden from
> callers. This will also allow us to keep the implementation compatible with
> the current mem2mem interfaces, which expose only two queues.
> 
> To support multiple cores, each core can connect to another core to establish a
> data dependency, in which case, they will communicate through a new type of
> queue, here described as "shared".
> 
>  input           shared         output
> <----- core0 -------> core1 ------>
> 
> This arrangement is basically an extension of the mem2mem idea, like so:
> 
> mem2mem2mem2mem
> 
> ...with as many links as there are cores.
> 
> The key idea is that now, cores can be scheduled independently through a call
> to schedule(core_number, work) to indicate that they should start processing
> the work. They can also be marked as idle independently through a
> job_done(core_number) call.
> 
> It will be the driver's responsibility to describe the pipeline to the
> framework, indicating how cores are connected. The driver will also have to
> implement the logic for schedule() and job_done() for a given core.
> 
> Queuing buffers into the framework's input queue will push the work into the
> pipeline. Whenever a job is done, the framework will push the job into the
> queue that is shared with the downstream core and attempt to schedule it. It
> will also attempt to pull a workitem from the upstream queue.
> 
> When the job is processed by the last core in the pipeline, it will be marked
> as done and pushed into the framework's output queue.
> 
> At all times, a buffer should have an owner, and the framework will ensure that
> cores cannot touch buffers belonging to other cores.
> 
> This workflow can be expanded to account for a group of identical cores, here
> denoted as "clusters". In such a case, each core will have its own input and
> output queues:
> 
>  input      output           input      output      output 
> <---- core0 ----->          <---- core1 ---->     ------->
>                                     <---- core2 ---->
>                                     input      output
> 
> Ideally, the framework will dispatch work from the output queue with the most
> amount of items to the input queue with the least amount of items to balance
> the load. This way, clusters and cores can compose to describe complex
> architectures.
> 
> Of course, this is a rough sketch, and there are lots of unexplained minutiae to
> sort out, but I hope that the general idea is enough to get a discussion going.
> 
> -- Daniel
>