Re: [PATCH v2 12/15] drm/panthor: Add the driver frontend block

Steven Price <steven.price@xxxxxxx> · Thu, 31 Aug 2023 15:42:05 +0100

On 29/08/2023 18:46, Boris Brezillon wrote:
> On Mon, 21 Aug 2023 12:31:29 +0100
> Steven Price <steven.price@xxxxxxx> wrote:
> 
>> On 09/08/2023 17:53, Boris Brezillon wrote:

[...]

>>> + *	// Collect signal operations on all jobs, such that each job can pick
>>> + *	// from it for its dependencies and update the fence to signal when
>>> + *	// the job is submitted.  
>>
>> I can't figure out here how we avoid depedency loops within a batch. 
>> What stops two jobs from each depending on each other?
>>
>> Or do we "allow" this but rely on the loop in panthor_submit_ctx_add_deps_and_arm_jobs()
>> to effectively enforce that a job cannot actually depend on a job
>> which is later in the batch.
> 
> You can't have circular dependencies because the job fence is created
> after its dependencies have been registered, so a job at the beginning
> of the array can't depend on a job that's coming after. It might be
> passed the same syncobj, but since a syncobj is just a container, the
> fence attached to the syncobj at the time the first job adds it as a
> dependency will point to a different dma_fence.
> 
>> In which case why bother with this
>> complexity rather than just performing all the steps on each job
>> in order?
> 
> Because, before submitting a set of jobs, we want to make sure all jobs
> that are passed to a submit request are valid and enough resources are
> available for their execution to proceed. We could allow partial
> execution (and that's actually the approach I had taken in one of the
> patch I proposed to allow submitting multiple jobs in one call to
> panfrost), but then you potentially have to figure out where things
> failed, not to mention that the syncobjs might point to intermediate
> dma_fence objects instead of the final one.
> 
>>
>> Being able to submit a forward dependency, but then having it
>> ignored seems like an odd design. So I feel like I must be
>> missing something.
> 
> It's not about allowing forward dependencies (that would be mess), but
> allowing one job to take a dependency on a job that was appearing
> earlier in the job array of the same submit call.
> 
>>
>>> + *	ret = panthor_submit_ctx_collect_jobs_signal_ops(&ctx);
> 
> Here panthor_submit_ctx_collect_jobs_signal_ops() is not registering
> job out_fences to the syncobjs, it's just collecting all signal
> operations from all jobs in an array. Each entry in this array contains
> the syncobj handle, the syncobj object, and the fence that was attached
> to it at the time the collection happens, and that's it.
> 
> Now, when a job are populated, and after we made sure it had
> everything it needs to be submitted, for each signal operation passed
> to this specific job, we update the corresponding entry in the signal
> array with the job finished fence, but the syncobj is not updated at
> that point, because we want to make sure all jobs belonging to a submit
> can be submitted before exposing their fences to the outside world.
> 
> For jobs happening later in the array, when we see a WAIT operation,
> we will first check the signal array to see if there's a
> corresponding entry cached there for the given syncobj handle, if there
> is, we take the dma_fence from here (this dma_fence might come from a
> job submitted earlier in this submit context, or it might be the fence
> that was there initially), if not, we call drm_syncobj_find_fence() to
> get the dependency.
> 
> Once all jobs have been parsed/checked/populated, we start the
> non-failing step => job submission. And after that point, we can start
> exposing the job fences to the outside world. This is what happens in
> panthor_submit_ctx_push_fences(): we iterate over the signal
> operations, and update each syncobj with the fence that was last
> attached to it (the last job in the submit array having a SIGNAL
> operation on that syncobj).

Thanks for the detailed explanation. I guess I hadn't considered the
benefits of checking everything is valid and obtaining resources before
submitting anything. That makes sense and I guess justifies this complexity.

[...]

>>> +static int
>>> +panthor_submit_ctx_add_job(struct panthor_submit_ctx *ctx, u32 idx,
>>> +			   struct drm_sched_job *job,
>>> +			   const struct drm_panthor_obj_array *syncs)
>>> +{
>>> +	struct panthor_device *ptdev = container_of(ctx->file->minor->dev,
>>> +						    struct panthor_device,
>>> +						    base);
>>> +	int ret;
>>> +
>>> +	if (drm_WARN_ON(&ptdev->base,
>>> +			idx >= ctx->job_count ||
>>> +			ctx->jobs[idx].job ||
>>> +			ctx->jobs[idx].syncops ||
>>> +			ctx->jobs[idx].syncop_count))
>>> +		return -EINVAL;
>>> +
>>> +	ctx->jobs[idx].job = job;  
>>
>> While the WARN_ON obviously shouldn't happen, this positioning of the 
>> ctx->jobs[].job assignment means the caller has no idea if the 
>> assignment has happened. AFAICT in the case of the WARN_ON the job isn't 
>> cleaned up properly.
> 
> It's not really about cleanup not happening, more about being passed an
> index that was already populated.
> 
>>
>> The options I can see are to move this line further down (and make the 
>> caller clean up that one job if this function fails), or to clean up the 
>> job in the case where the WARN_ON fails.
> 
> Maybe I should drop this WARN_ON() and assume the caller passed a valid
> index...

I'd be fine with that. My reordering suggestion is a bit pointless I
must admit ;)

[...]

>>> +
>>> +	for (u32 i = 0; i < sync_op_count; i++) {
>>> +		struct panthor_sync_signal *sig_sync;
>>> +		struct dma_fence *fence;
>>> +
>>> +		if (sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL)
>>> +			continue;  
>>
>> NIT: It might be worth having a helper for the operation type. It's a 
>> little confusing that we have !(flags & SIGNAL) and (flags & SIGNAL) but 
>> not (flags & WAIT) - obviously looking at the definition shows why. Also 
>> there'll be a lot of careful refactoring needed if a third operation is 
>> ever added.
> 
> I had the operation as a separate field initially, but I couldn't think
> of any other operations we could do on a syncobj, so I decided to make
> it a flag, and mimic what Xe does.

A flag is fine, I just find it harder to read:

 if (sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL)
 [...]
 if (!(sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL)

vs

 bool is_signal_op(struct drm_panthor_sync_op *op)
 {
	return !!(op->flags & DRM_PANTHOR_SYNC_OP_SIGNAL);
 }

 bool is_wait_op(struct drm_panthor_sync_op *op)
 {
	return !(op->flags & DRM_PANTHOR_SYNC_OP_SIGNAL);
 }

 if (is_signal_op(&sync_ops[i]))
 [...]
 if (is_wait_op(&sync_ops[i]))

And it avoid anyone accidentally writing:

 if (sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_WAIT)

which in my quick test the compiler doesn't even warn on :(

Although on the subject of the flag, apparently the enumeration type
value doesn't compile with -pedantic as it overflows into the sign bit:

include/drm/panthor_drm.h:237:31: warning: enumerator value for
‘DRM_PANTHOR_SYNC_OP_SIGNAL’ is not an integer constant expression
[-Wpedantic]
  237 |  DRM_PANTHOR_SYNC_OP_SIGNAL = 1 << 31,

Changing it to "(int)(1u << 31)" seems to be workaround. This affects
DRM_PANTHOR_VM_BIND_OP_TYPE_MASK too.

>>
[...]
>>> +#define PANTHOR_VM_CREATE_FLAGS			0
>>> +
>>> +static int panthor_ioctl_vm_create(struct drm_device *ddev, void *data,
>>> +				   struct drm_file *file)
>>> +{
>>> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
>>> +	u32 va_bits = GPU_MMU_FEATURES_VA_BITS(ptdev->gpu_info.mmu_features);
>>> +	struct panthor_file *pfile = file->driver_priv;
>>> +	struct drm_panthor_vm_create *args = data;
>>> +	u64 kernel_va_start = 0;
>>> +	int cookie, ret;
>>> +
>>> +	if (!drm_dev_enter(ddev, &cookie))
>>> +		return -ENODEV;
>>> +
>>> +	if (args->flags & ~PANTHOR_VM_CREATE_FLAGS) {
>>> +		ret = -EINVAL;
>>> +		goto out_dev_exit;
>>> +	}
>>> +
>>> +	if (drm_WARN_ON(ddev, !va_bits) || args->kernel_va_range > (1ull << (va_bits - 1))) {  
>>
>> The check for !va_bits would be better done at probe time. I'd also be 
>> tempted to move the change for kernel_va_range down to 
>> panthor_vm_create() as that has to repeat the va_bits calculation.
>>
>>> +		ret = -EINVAL;
>>> +		goto out_dev_exit;
>>> +	}
>>> +
>>> +	if (args->kernel_va_range)
>>> +		kernel_va_start = (1 << (va_bits - 1)) - args->kernel_va_range;  
>>
>> And also push the calculation of va_start down to 
>> panthor_vm_create() as well.
> 
> panthor_vm_create() is used internally, for the MCU VM creation, and
> I'd prefer to keep it uAPI agnostic. I don't mind moving it to
> panthor_vm_pool_create_vm() but we'd still have to do the va_bits
> calculation twice.

Ah true, for panthor_vm_create() you need to be able to pass in the VA
range for the MCU. We do have the "for_mcu" flag so the
CSF_MCU_SHARED_REGION_START/SIZE #defines could be used directly in
panthor_vm_create(). But I'd be happy with it in
panthor_vm_pool_create_vm() if you'd prefer.

Steve