Re: [PATCH v4 1/4] drm/panthor: introduce job cycle and timestamp accounting

Steven Price <steven.price@xxxxxxx> · Mon, 19 Aug 2024 08:48:24 +0100

Hi Adrián,

On 31/07/2024 13:41, Adrián Larumbe wrote:
> Hi Steven, thanks for the remarks.
> 
> On 19.07.2024 15:14, Steven Price wrote:
>> On 16/07/2024 21:11, Adrián Larumbe wrote:
>>> Enable calculations of job submission times in clock cycles and wall
>>> time. This is done by expanding the boilerplate command stream when running
>>> a job to include instructions that compute said times right before an after
>>> a user CS.
>>>
>>> Those numbers are stored in the queue's group's sync objects BO, right
>>> after them. Because the queues in a group might have a different number of
>>> slots, one must keep track of the overall slot tally when reckoning the
>>> offset of a queue's time sample structs, one for each slot.
>>>
>>> This commit is done in preparation for enabling DRM fdinfo support in the
>>> Panthor driver, which depends on the numbers calculated herein.
>>>
>>> A profile mode device flag has been added that will in a future commit
>>> allow UM to toggle time sampling behaviour, which is disabled by default to
>>> save power. It also enables marking jobs as being profiled and picks one of
>>> two call instruction arrays to insert into the ring buffer. One of them
>>> includes FW logic to sample the timestamp and cycle counter registers and
>>> write them into the job's syncobj, and the other does not.
>>>
>>> A profiled job's call sequence takes up two ring buffer slots, and this is
>>> reflected when initialising the DRM scheduler for each queue, with a
>>> profiled job contributing twice as many credits.
>>>
>>> Signed-off-by: Adrián Larumbe <adrian.larumbe@xxxxxxxxxxxxx>
>>
>> Thanks for the updates, this looks better. A few minor comments below.
>>
>>> ---
>>>  drivers/gpu/drm/panthor/panthor_device.h |   2 +
>>>  drivers/gpu/drm/panthor/panthor_sched.c  | 244 ++++++++++++++++++++---
>>>  2 files changed, 216 insertions(+), 30 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
>>> index e388c0472ba7..3ede2f80df73 100644
>>> --- a/drivers/gpu/drm/panthor/panthor_device.h
>>> +++ b/drivers/gpu/drm/panthor/panthor_device.h
>>> @@ -162,6 +162,8 @@ struct panthor_device {
>>>  		 */
>>>  		struct page *dummy_latest_flush;
>>>  	} pm;
>>> +
>>> +	bool profile_mode;
>>>  };
>>>  
>>>  /**
>>> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
>>> index 79ffcbc41d78..6438e5ea1f2b 100644
>>> --- a/drivers/gpu/drm/panthor/panthor_sched.c
>>> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
>>> @@ -93,6 +93,9 @@
>>>  #define MIN_CSGS				3
>>>  #define MAX_CSG_PRIO				0xf
>>>  
>>> +#define NUM_INSTRS_PER_SLOT			16
>>> +#define SLOTSIZE				(NUM_INSTRS_PER_SLOT * sizeof(u64))
>>> +
>>>  struct panthor_group;
>>>  
>>>  /**
>>> @@ -466,6 +469,9 @@ struct panthor_queue {
>>>  		 */
>>>  		struct list_head in_flight_jobs;
>>>  	} fence_ctx;
>>> +
>>> +	/** @time_offset: Offset of panthor_job_times structs in group's syncobj bo. */
>>> +	unsigned long time_offset;
>>
>> AFAICT this doesn't need to be stored. We could just pass this value
>> into group_create_queue() as an extra parameter where it's used.
> 
> I think we need to keep this offset value around, because queues within the same group
> could have a variable number of slots, so when fetching the sampled values from the
> syncobjs BO in update_fdinfo_stats, it would have to traverse the entire array of
> preceding queues and figure out their size in slots so as to jump over as many
> struct panthor_job_times after the preceding syncobj array.

Yes I think I was getting myself confused - for some reason I'd thought
the ring buffers in each queue were the same size. It makes sense to
keep this.

<snip>

>>> @@ -3384,9 +3565,12 @@ panthor_job_create(struct panthor_file *pfile,
>>>  		goto err_put_job;
>>>  	}
>>>  
>>> +	job->is_profiled = pfile->ptdev->profile_mode;
>>> +
>>>  	ret = drm_sched_job_init(&job->base,
>>>  				 &job->group->queues[job->queue_idx]->entity,
>>> -				 1, job->group);
>>> +				 job->is_profiled ? NUM_INSTRS_PER_SLOT * 2 :
>>> +				 NUM_INSTRS_PER_SLOT, job->group);
>>
>> Is there a good reason to make the credits the number of instructions,
>> rather than the number of slots? If we were going to count the actual
>> number of non-NOP instructions then there would be some logic (although
>> I'm not convinced that makes much sense), but considering we only allow
>> submission in "slot granules" we might as well use that as the unit of
>> "credit".
> 
> In my initial pre-ML version of the patch series I was passing the number of
> queue slots as the total credit count, but Boris was keener on setting it to
> the total number of instructions instead.
> 
> I agree with you that both are equivalent, one just being an integer multiple
> of the other, so I'm fine with either choice. Maybe Boris can pitch in, in
> case he has a strong opinion about this.

I wouldn't say I have a strong opinion, it just seems a little odd to be
multiplying the value by a constant everywhere. If you'd got some
changes planned where the instructions could vary more dynamically that
would be good to know about.

If Boris is "keener" on this approach then that's fine, we'll leave it
this way.

Steve