13.09.2020 12:51, Mikko Perttunen пишет: ... >> All waits that are internal to a job should only wait for relative sync >> point increments. > >> In the grate-kernel every job uses unique-and-clean sync point (which is >> also internal to the kernel driver) and a relative wait [1] is used for >> the job's internal sync point increments [2][3][4], and thus, kernel >> driver simply jumps over a hung job by updating DMAGET to point at the >> start of a next job. > > Issues I have with this approach: > > * Both this and my approach have the requirement for userspace, that if > a job hangs, the userspace must ensure all external waiters have timed > out / been stopped before the syncpoint can be freed, as if the > syncpoint gets reused before then, false waiter completions can happen. > > So freeing the syncpoint must be exposed to userspace. The kernel cannot > do this since there may be waiters that the kernel is not aware of. My > proposal only has one syncpoint, which I feel makes this part simpler, too. > > * I believe this proposal requires allocating a syncpoint for each > externally visible syncpoint increment that the job does. This can use > up quite a few syncpoints, and it makes syncpoints a dynamically > allocated resource with unbounded allocation latency. This is a problem > for safety-related systems. Maybe we could have a special type of a "shared" sync point that is allocated per-hardware engine? Then shared SP won't be a scarce resource and job won't depend on it. The kernel or userspace driver may take care of recovering the counter value of a shared SP when job hangs or do whatever else is needed without affecting the job's sync point. Primarily I'm not feeling very happy about retaining the job's sync point recovery code because it was broken the last time I touched it and grate-kernel works fine without it. > * If a job fails on a "virtual channel" (userctx), I think it's a > reasonable expectation that further jobs on that "virtual channel" will > not execute, and I think implementing that model is simpler than doing > recovery. Couldn't jobs just use explicit fencing? Then a second job won't be executed if first job hangs and explicit dependency is expressed. I'm not sure that concept of a "virtual channel" is applicable to drm-scheduler. I'll need to see a full-featured driver implementation and the test cases that cover all the problems that you're worried about because I'm not aware about all the T124+ needs and seeing code should help. Maybe in the end yours approach will be the best, but for now it's not clear :)