On 22/01/2024 16:30, Boris Brezillon wrote: > This is the piece of software interacting with the FW scheduler, and > taking care of some scheduling aspects when the FW comes short of slots > scheduling slots. Indeed, the FW only expose a few slots, and the kernel > has to give all submission contexts, a chance to execute their jobs. > > The kernel-side scheduler is timeslice-based, with a round-robin queue > per priority level. > > Job submission is handled with a 1:1 drm_sched_entity:drm_gpu_scheduler, > allowing us to delegate the dependency tracking to the core. > > All the gory details should be documented inline. > > v4: > - Check drmm_mutex_init() return code > - s/drm_gem_vmap_unlocked/drm_gem_vunmap_unlocked/ in > panthor_queue_put_syncwait_obj() > - Drop unneeded WARN_ON() in cs_slot_sync_queue_state_locked() > - Use atomic_xchg() instead of atomic_fetch_and(0) > - Fix typos > - Let panthor_kernel_bo_destroy() check for IS_ERR_OR_NULL() BOs > - Defer TILER_OOM event handling to a separate workqueue to prevent > deadlocks when the heap chunk allocation is blocked on mem-reclaim. > This is just a temporary solution, until we add support for > non-blocking/failable allocations > - Pass the scheduler workqueue to drm_sched instead of instantiating > a separate one (no longer needed now that heap chunk allocation > happens on a dedicated wq) > - Set WQ_MEM_RECLAIM on the scheduler workqueue, so we can handle > job timeouts when the system is under mem pressure, and hopefully > free up some memory retained by these jobs > > v3: > - Rework the FW event handling logic to avoid races > - Make sure MMU faults kill the group immediately > - Use the panthor_kernel_bo abstraction for group/queue buffers > - Make in_progress an atomic_t, so we can check it without the reset lock > held > - Don't limit the number of groups per context to the FW scheduler > capacity. Fix the limit to 128 for now. > - Add a panthor_job_vm() helper > - Account for panthor_vm changes > - Add our job fence as DMA_RESV_USAGE_WRITE to all external objects > (was previously DMA_RESV_USAGE_BOOKKEEP). I don't get why, given > we're supposed to be fully-explicit, but other drivers do that, so > there must be a good reason > - Account for drm_sched changes > - Provide a panthor_queue_put_syncwait_obj() > - Unconditionally return groups to their idle list in > panthor_sched_suspend() > - Condition of sched_queue_{,delayed_}work fixed to be only when a reset > isn't pending or in progress. > - Several typos in comments fixed. > > Co-developed-by: Steven Price <steven.price@xxxxxxx> > Signed-off-by: Steven Price <steven.price@xxxxxxx> > Signed-off-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> Looks good, a few minor NITs below, but otherwise. Reviewed-by: Steven Price <steven.price@xxxxxxx> > --- > drivers/gpu/drm/panthor/panthor_sched.c | 3500 +++++++++++++++++++++++ > drivers/gpu/drm/panthor/panthor_sched.h | 48 + > 2 files changed, 3548 insertions(+) > create mode 100644 drivers/gpu/drm/panthor/panthor_sched.c > create mode 100644 drivers/gpu/drm/panthor/panthor_sched.h > > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c > new file mode 100644 > index 000000000000..b57f4a5f5176 > --- /dev/null > +++ b/drivers/gpu/drm/panthor/panthor_sched.c > @@ -0,0 +1,3500 @@ > +// SPDX-License-Identifier: GPL-2.0 or MIT > +/* Copyright 2023 Collabora ltd. */ > + > +#include <drm/panthor_drm.h> > +#include <drm/drm_drv.h> > +#include <drm/drm_exec.h> > +#include <drm/drm_gem_shmem_helper.h> > +#include <drm/drm_managed.h> > +#include <drm/gpu_scheduler.h> > + > +#include <linux/build_bug.h> > +#include <linux/clk.h> > +#include <linux/delay.h> > +#include <linux/dma-mapping.h> > +#include <linux/firmware.h> > +#include <linux/interrupt.h> > +#include <linux/io.h> > +#include <linux/iopoll.h> > +#include <linux/iosys-map.h> > +#include <linux/module.h> > +#include <linux/platform_device.h> > +#include <linux/pm_runtime.h> > +#include <linux/dma-resv.h> > + > +#include "panthor_sched.h" > +#include "panthor_devfreq.h" > +#include "panthor_device.h" > +#include "panthor_gem.h" > +#include "panthor_heap.h" > +#include "panthor_regs.h" > +#include "panthor_gpu.h" > +#include "panthor_fw.h" > +#include "panthor_mmu.h" > + > +/** > + * DOC: Scheduler > + * > + * Mali CSF hardware adopts a firmware-assisted scheduling model, where > + * the firmware takes care of scheduling aspects, to some extend. NIT: s/extend/extent/ > + * > + * The scheduling happens at the scheduling group level, each group > + * contains 1 to N queues (N is FW/hardware dependent, and exposed > + * through the firmware interface). Each queue is assigned a command > + * stream ring buffer, which serves as a way to get jobs submitted to > + * the GPU, among other things. > + * > + * The firmware can schedule a maximum of M groups (M is FW/hardware > + * dependent, and exposed through the firmware interface). Passed > + * this maximum number of groups, the kernel must take care of > + * rotating the groups passed to the firmware so every group gets > + * a chance to have his queues scheduled for execution. > + * > + * The current implementation only supports with kernel-mode queues. > + * In other terms, userspace doesn't have access to the ring-buffer. > + * Instead, userspace passes indirect command stream buffers that are > + * called from the queue ring-buffer by the kernel using a pre-defined > + * sequence of command stream instructions to ensure the userspace driver > + * always gets consistent results (cache maintenance, > + * synchronization, ...). > + * > + * We rely on the drm_gpu_scheduler framework to deal with job > + * dependencies and submission. As any other driver dealing with a > + * FW-scheduler, we use the 1:1 entity:scheduler mode, such that each > + * entity has its own job scheduler. When a job is ready to be executed > + * (all its dependencies are met), it is pushed to the appropriate > + * queue ring-buffer, and the group is scheduled for execution if it > + * wasn't already active. > + * > + * Kernel-side group scheduling is timeslice-based. When we have less > + * groups than there are slots, the periodic tick is disabled and we > + * just let the FW schedule the active groups. When there are more > + * groups than slots, we let each group a chance to execute stuff for > + * a given amount of time, and then re-evaluate and pick new groups > + * to schedule. The group selection algorithm is based on > + * priority+round-robin. > + * > + * Even though user-mode queues is out of the scope right now, the > + * current design takes them into account by avoiding any guess on the > + * group/queue state that would be based on information we wouldn't have > + * if userspace was in charge of the ring-buffer. That's also one of the > + * reason we don't do 'cooperative' scheduling (encoding FW group slot > + * reservation as dma_fence that would be returned from the > + * drm_gpu_scheduler::prepare_job() hook, and treating group rotation as > + * a queue of waiters, ordered by job submission order). This approach > + * would work for kernel-mode queues, but would make user-mode queues a > + * lot more complicated to retrofit. > + */ > + <snip> > + > +/** > + * struct panthor_group - Scheduling group object > + */ > +struct panthor_group { > + /** @refcount: Reference count */ > + struct kref refcount; > + > + /** @ptdev: Device. */ > + struct panthor_device *ptdev; > + > + /** @vm: VM bound to the group. */ > + struct panthor_vm *vm; > + > + /** @compute_core_mask: Mask of shader cores that can be used for compute jobs. */ > + u64 compute_core_mask; > + > + /** @fragment_core_mask: Mask of shader cores that can be used for fragment jobs. */ > + u64 fragment_core_mask; > + > + /** @tiler_core_mask: Mask of tiler cores that can be used for tiler jobs. */ > + u64 tiler_core_mask; > + > + /** @max_compute_cores: Maximum number of shader cores used for compute jobs. */ > + u8 max_compute_cores; > + > + /** @max_compute_cores: Maximum number of shader cores used for fragment jobs. */ > + u8 max_fragment_cores; > + > + /** @max_tiler_cores: Maximum number of tiler cores used for tiler jobs. */ > + u8 max_tiler_cores; > + > + /** @priority: Group priority (check panthor_csg_priority). */ > + u8 priority; > + > + /** @blocked_queues: Bitmask reflecting the blocked queues. */ > + u32 blocked_queues; > + > + /** @idle_queues: Bitmask reflecting the blocked queues. */ s/blocked/idle/ > + u32 idle_queues; > + > + /** @fatal_lock: Lock used to protect access to fatal fields. */ > + spinlock_t fatal_lock; > + > + /** @fatal_queues: Bitmask reflecting the queues that hit a fatal exception. */ > + u32 fatal_queues; > + > + /** @tiler_oom: Mask of queues that have a tiler OOM event to process. */ > + atomic_t tiler_oom; > + > + /** @queue_count: Number of queues in this group. */ > + u32 queue_count; > + > + /** @queues: Queues owned by this group. */ > + struct panthor_queue *queues[MAX_CS_PER_CSG]; > + > + /** > + * @csg_id: ID of the FW group slot. > + * > + * -1 when the group is not scheduled/active. > + */ > + int csg_id; > + > + /** > + * @destroyed: True when the group has been destroyed. > + * > + * If a group is destroyed it becomes useless: no further jobs can be submitted > + * to its queues. We simply wait for all references to be dropped so we can > + * release the group object. > + */ > + bool destroyed; > + > + /** > + * @timedout: True when a timeout occurred on any of the queues owned by > + * this group. > + * > + * Timeouts can be reported by drm_sched or by the FW. In any case, any > + * timeout situation is unrecoverable, and the group becomes useless. > + * We simply wait for all references to be dropped so we can release the > + * group object. > + */ > + bool timedout; > + > + /** > + * @syncobjs: Pool of per-queue synchronization objects. > + * > + * One sync object per queue. The position of the sync object is > + * determined by the queue index. > + */ > + struct panthor_kernel_bo *syncobjs; > + > + /** @state: Group state. */ > + enum panthor_group_state state; > + > + /** > + * @suspend_buf: Suspend buffer. > + * > + * Stores the state of the group and its queues when a group is suspended. > + * Used at resume time to restore the group in its previous state. > + * > + * The size of the suspend buffer is exposed through the FW interface. > + */ > + struct panthor_kernel_bo *suspend_buf; > + > + /** > + * @protm_suspend_buf: Protection mode suspend buffer. > + * > + * Stores the state of the group and its queues when a group that's in > + * protection mode is suspended. > + * > + * Used at resume time to restore the group in its previous state. > + * > + * The size of the protection mode suspend buffer is exposed through the > + * FW interface. > + */ > + struct panthor_kernel_bo *protm_suspend_buf; > + > + /** @sync_upd_work: Work used to check/signal job fences. */ > + struct work_struct sync_upd_work; > + > + /** @tiler_oom_work: Work used to process tiler OOM events happening on this group. */ > + struct work_struct tiler_oom_work; > + > + /** @term_work: Work used to finish the group termination procedure. */ > + struct work_struct term_work; > + > + /** > + * @release_work: Work used to release group resources. > + * > + * We need to postpone the group release to avoid a deadlock when > + * the last ref is released in the tick work. > + */ > + struct work_struct release_work; > + > + /** > + * @run_node: Node used to insert the group in the > + * panthor_group::groups::{runnable,idle} and > + * panthor_group::reset.stopped_groups lists. > + */ > + struct list_head run_node; > + > + /** > + * @wait_node: Node used to insert the group in the > + * panthor_group::groups::waiting list. > + */ > + struct list_head wait_node; > +}; > + > +/** > + * group_queue_work() - Queue a group work > + * @group: Group to queue the work for. > + * @wname: Work name. > + * > + * Grabs a ref and queue a work item to the scheduler workqueue. If > + * the work was already queued, we release the reference we grabbed. > + * > + * Work callbacks must release the reference we grabbed here. > + */ > +#define group_queue_work(group, wname) \ > + do { \ > + group_get(group); \ > + if (!queue_work((group)->ptdev->scheduler->wq, &(group)->wname ## _work)) \ > + group_put(group); \ > + } while (0) > + > +/** > + * sched_queue_work() - Queue a scheduler work. > + * @sched: Scheduler object. > + * @wname: Work name. > + * > + * Conditionally queues a scheduler work if no reset is pending/in-progress. > + */ > +#define sched_queue_work(sched, wname) \ > + do { \ > + if (!atomic_read(&sched->reset.in_progress) && \ > + !panthor_device_reset_is_pending((sched)->ptdev)) \ > + queue_work((sched)->wq, &(sched)->wname ## _work); \ > + } while (0) > + > +/** > + * sched_queue_delayed_work() - Queue a scheduler delayed work. > + * @sched: Scheduler object. > + * @wname: Work name. > + * @delay: Work delay in jiffies. > + * > + * Conditionally queues a scheduler delayed work if no reset is > + * pending/in-progress. > + */ > +#define sched_queue_delayed_work(sched, wname, delay) \ > + do { \ > + if (!atomic_read(&sched->reset.in_progress) && \ > + !panthor_device_reset_is_pending((sched)->ptdev)) \ > + mod_delayed_work((sched)->wq, &(sched)->wname ## _work, delay); \ > + } while (0) > + > +/* > + * We currently set the maximum of groups per file to an arbitrary low value. > + * But this can be updated if we need more. > + */ > +#define MAX_GROUPS_PER_POOL 128 > + > +/** > + * struct panthor_group_pool - Group pool > + * > + * Each file get assigned a group pool. > + */ > +struct panthor_group_pool { > + /** @xa: Xarray used to manage group handles. */ > + struct xarray xa; > +}; > + > +/** > + * struct panthor_job - Used to manage GPU job > + */ > +struct panthor_job { > + /** @base: Inherit from drm_sched_job. */ > + struct drm_sched_job base; > + > + /** @refcount: Reference count. */ > + struct kref refcount; > + > + /** @group: Group of the queue this job will be pushed to. */ > + struct panthor_group *group; > + > + /** @queue_idx: Index of the queue inside @group. */ > + u32 queue_idx; > + > + /** @call_info: Information about the userspace command stream call. */ > + struct { > + /** @start: GPU address of the userspace command stream. */ > + u64 start; > + > + /** @size: Size of the userspace command stream. */ > + u32 size; > + > + /** > + * @latest_flush: Flush ID at the time the userspace command > + * stream was built. > + * > + * Needed for the flush reduction mechanism. > + */ > + u32 latest_flush; > + } call_info; > + > + /** @ringbuf: Position of this job is in the ring buffer. */ > + struct { > + /** @start: Start offset. */ > + u64 start; > + > + /** @end: End offset. */ > + u64 end; > + } ringbuf; > + > + /** > + * @node: Used to insert the job in the panthor_queue::fence_ctx::in_flight_jobs > + * list. > + */ > + struct list_head node; > + > + /** @done_fence: Fence signaled when the job is finished or cancelled. */ > + struct dma_fence *done_fence; > +}; > + > +static void > +panthor_queue_put_syncwait_obj(struct panthor_queue *queue) > +{ > + if (queue->syncwait.kmap) { > + struct iosys_map map = IOSYS_MAP_INIT_VADDR(queue->syncwait.kmap); > + > + drm_gem_vunmap_unlocked(queue->syncwait.obj, &map); > + queue->syncwait.kmap = NULL; > + } > + > + drm_gem_object_put(queue->syncwait.obj); > + queue->syncwait.obj = NULL; > +} > + > +static void * > +panthor_queue_get_syncwait_obj(struct panthor_group *group, struct panthor_queue *queue) > +{ > + struct panthor_device *ptdev = group->ptdev; > + struct panthor_gem_object *bo; > + struct iosys_map map; > + int ret; > + > + if (queue->syncwait.kmap) > + return queue->syncwait.kmap + queue->syncwait.offset; > + > + bo = panthor_vm_get_bo_for_va(group->vm, > + queue->syncwait.gpu_va, > + &queue->syncwait.offset); > + if (drm_WARN_ON(&ptdev->base, IS_ERR_OR_NULL(bo))) > + goto err_put_syncwait_obj; > + > + queue->syncwait.obj = &bo->base.base; > + ret = drm_gem_vmap_unlocked(queue->syncwait.obj, &map); > + if (drm_WARN_ON(&ptdev->base, ret)) > + goto err_put_syncwait_obj; > + > + queue->syncwait.kmap = map.vaddr; > + if (drm_WARN_ON(&ptdev->base, !queue->syncwait.kmap)) > + goto err_put_syncwait_obj; > + > + return queue->syncwait.kmap + queue->syncwait.offset; > + > +err_put_syncwait_obj: > + panthor_queue_put_syncwait_obj(queue); > + return NULL; > +} > + > +static void group_free_queue(struct panthor_group *group, struct panthor_queue *queue) > +{ > + if (IS_ERR_OR_NULL(queue)) > + return; > + > + if (queue->entity.fence_context) > + drm_sched_entity_destroy(&queue->entity); > + > + if (queue->scheduler.ops) > + drm_sched_fini(&queue->scheduler); > + > + panthor_queue_put_syncwait_obj(queue); > + > + panthor_kernel_bo_destroy(group->vm, queue->ringbuf); > + panthor_kernel_bo_destroy(panthor_fw_vm(group->ptdev), queue->iface.mem); > + > + kfree(queue); > +} > + > +static void group_release_work(struct work_struct *work) > +{ > + struct panthor_group *group = container_of(work, > + struct panthor_group, > + release_work); > + struct panthor_device *ptdev = group->ptdev; > + u32 i; > + > + for (i = 0; i < group->queue_count; i++) > + group_free_queue(group, group->queues[i]); > + > + panthor_kernel_bo_destroy(panthor_fw_vm(ptdev), group->suspend_buf); > + panthor_kernel_bo_destroy(panthor_fw_vm(ptdev), group->protm_suspend_buf); > + > + if (!IS_ERR_OR_NULL(group->syncobjs)) No need for this check - panthor_kernel_bo_destroy() does it. > + panthor_kernel_bo_destroy(group->vm, group->syncobjs); > + > + panthor_vm_put(group->vm); > + kfree(group); > +} Thanks, Steve