Quoting Chris Wilson (2019-07-27 01:27:02) > Quoting Bloomfield, Jon (2019-07-26 23:19:38) > > Hmmn. We're still on orthogonal perspectives as far as our previous arguments stand. But it doesn't matter because while thinking through your replies, I realized there is one argument in favour, which trumps all my previous arguments against this patch - it makes things deterministic. Without this patch (or hangcheck), whether a context gets nuked depends on what else is running. And that's a recipe for confused support emails. > > > > So I retract my other arguments, thanks for staying with me :-) > > No worries, it's been really useful, especially realising a few more > areas we can improve our resilience. You will get your way eventually. > (But what did it cost? Everything.) Ok, so just confirming here. The plan is still to have userspace set a per context (or per request) time limit for expected completion of a request. This will be useful for the media workloads that consume deterministic amount of time for correct bitstream. And the userspace wants to be notified much quicker than the generic hangcheck time if the operation failed due to corrupt bitstream. This time limit can be set to infinite by compute workloads. Then, in parallel to that, we have cgroups or system wide configuration for maximum allowed timeslice per process/context. That means that a long-running workload must pre-empt at that granularity. That pre-emption/hearbeat should happen regardless if others contexts are requesting the hardware or not, because better start recovery of a hung task as soon as it misbehaves. Regards, Joonas _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx