29.06.2020 13:27, Mikko Perttunen пишет: ... >>>> 4. The job's sync point can't be re-used after job's submission (UAPI >>>> constraint!). Userspace must free sync point and allocate a new one for >>>> the next job submission. And now we: >>>> >>>> - Know that job's sync point is always in a healthy state! >>>> >>>> - We're not limited by a number of physically available hardware >>>> sync >>>> points! Allocation should block until free sync point is available. >>>> >>>> - The logical number of job's sync point increments matches the SP >>>> hardware state! Which is handy for a job's debugging. >>>> >>>> Optionally, the job's sync point could be auto-removed from the DRM's >>>> context after job's submission, avoiding a need for an extra SYNCPT_PUT >>>> IOCTL invocation to be done by userspace after the job's submission. >>>> Could be a job's flag. >>> >>> I think this would cause problems where after a job completes but before >>> the fence has been waited, the syncpoint is already recycled (especially >>> if the syncpoint is reset into some clean state). >> >> Exactly, good point! The dma-fence shouldn't be hardwired to the sync >> point in order to avoid this situation :) >> >> Please take a look at the fence implementation that I made for the >> grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached >> to a sync point by host1x_fence_create(). Once job is completed, the >> host1x-fence is detached from the sync point [5][6] and sync point could >> be recycled safely! > > What if the fence has been programmed as a prefence to another channel > (that is getting delayed), or to the GPU, or some other accelerator like > DLA, or maybe some other VM? Those don't know the dma_fence has been > signaled, they can only rely on the syncpoint ID/threshold pair. The explicit job's fence is always just a dma-fence, it's not tied to a host1x-fence and it should be waited for a signal on CPU. If you want to make job to wait for a sync point on hardware, then you should use the drm_tegra_submit_command wait-command. Again, please notice that DRM scheduler supports the job-submitted fence! This dma-fence will signal once job is pushed to hardware for execution, so it shouldn't be a problem to maintain jobs order for a complex jobs without much hassle. We'll need to write some userspace to check how it works in practice :) For now I really had experience with a simple jobs only. Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync point, correct? Or are they integrated with Host1x HW? Anyways, it shouldn't be difficult to resolve dma-fence into host1x-fence, get SP ID and maintain the SP's liveness. Please see more below. In the grate-driver I made all sync points refcounted, so sync point won't be recycled while it has active users [1][2][3] in kernel (or userspace). [1] https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L428 [2] https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L1206 [3] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syncpoints.c#L163 Now, grate-kernel isn't a 100% complete implementation, as I already mentioned before. The host1x-fence doesn't have a reference to a sync point as you may see in the code, this is because the userspace sync points are not implemented in the grate-driver. But nothing stops us to add that SP reference and then we could simply do the following in the code: struct dma_fence *host1x_fence_create(syncpt,...) { ... fence->sp = syncpt; ... return &fence->base; } void host1x_syncpt_signal_fence(struct host1x_fence *fence) { ... fence->sp = NULL; } irqreturn_t host1x_hw_syncpt_isr() { spin_lock(&host1x_syncpts_lock); ... host1x_syncpt_signal_fence(sp->fence); ... spin_unlock(&host1x_syncpts_lock); } void host1x_submit_job(job) { ... spin_lock_irqsave(&host1x_syncpts_lock); sp = host1x_syncpt_get(host1x_fence->sp); spin_unlock_irqrestore(&host1x_syncpts_lock); ... if (sp) { push(WAIT(sp->id, host1x_fence->threshold)); job->sync_points = sp; } } void host1x_free_job(job) { host1x_syncpt_put(job->sync_points); ... } For example: if you'll share host1-fence (dma-fence) with a GPU context, then the fence's SP won't be released until GPU's context will be done with the SP. The GPU's job will be timed out if shared SP won't get incremented, and this is totally okay situation. Does this answer yours question? === I'm not familiar with the Host1x VMs, so please let me use my imagination here: In a case of VM we could have a special VM-shared sync point type. The userspace will allocate this special VM SP using ALLOCATE_SYNCPOINT and this SP won't be used for a job(!). This is the case where job will need to increment multiple sync points, its own SP + VM's SP. If job hangs, then there should be a way to tell VM to release the SP and try again next time with a freshly-allocated SP. The shared SP should stay alive as long as VM uses it, so there should be a way for VM to tell that it's done with SP. Alternatively, we could add a SP recovery (or whatever is needed) for the VM, but this should be left specific to T194+. Older Tegras shouldn't ever need this complexity if I'm not missing anything. Please provide a detailed information about the VM's workflow if the above doesn't sound good. Perhaps we shouldn't focus on the VM support for now, but may left some room for a potential future expansion if necessary.