Re: [RFC PATCH v2 10/17] WIP: gpu: host1x: Add no-recovery mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/13/20 9:37 PM, Dmitry Osipenko wrote:
13.09.2020 12:51, Mikko Perttunen пишет:
...
All waits that are internal to a job should only wait for relative sync
point increments. >
In the grate-kernel every job uses unique-and-clean sync point (which is
also internal to the kernel driver) and a relative wait [1] is used for
the job's internal sync point increments [2][3][4], and thus, kernel
driver simply jumps over a hung job by updating DMAGET to point at the
start of a next job.

Issues I have with this approach:

* Both this and my approach have the requirement for userspace, that if
a job hangs, the userspace must ensure all external waiters have timed
out / been stopped before the syncpoint can be freed, as if the
syncpoint gets reused before then, false waiter completions can happen.

So freeing the syncpoint must be exposed to userspace. The kernel cannot
do this since there may be waiters that the kernel is not aware of. My
proposal only has one syncpoint, which I feel makes this part simpler, too.

* I believe this proposal requires allocating a syncpoint for each
externally visible syncpoint increment that the job does. This can use
up quite a few syncpoints, and it makes syncpoints a dynamically
allocated resource with unbounded allocation latency. This is a problem
for safety-related systems.

Maybe we could have a special type of a "shared" sync point that is
allocated per-hardware engine? Then shared SP won't be a scarce resource
and job won't depend on it. The kernel or userspace driver may take care
of recovering the counter value of a shared SP when job hangs or do
whatever else is needed without affecting the job's sync point.

Having a shared syncpoint opens up possibilities for interference between jobs (if we're not using the firewall, the HW cannot distinguish between jobs on the same channel), and doesn't work if there are multiple channels using the same engine, which we want to do for newer chips (for performance and virtualization reasons).

Even then, even if we need to allocate one syncpoint per job, the issue seems to be there.


Primarily I'm not feeling very happy about retaining the job's sync
point recovery code because it was broken the last time I touched it and
grate-kernel works fine without it.

I'm not planning to retain it any longer than necessary, which is until the staging interface is removed. Technically I can already remove it now -- that would cause any users of the staging interface to potentially behave weirdly if a job times out, but maybe we don't care about that all that much?


* If a job fails on a "virtual channel" (userctx), I think it's a
reasonable expectation that further jobs on that "virtual channel" will
not execute, and I think implementing that model is simpler than doing
recovery.

Couldn't jobs just use explicit fencing? Then a second job won't be
executed if first job hangs and explicit dependency is expressed. I'm
not sure that concept of a "virtual channel" is applicable to drm-scheduler.

I assume what you mean is that each job incrementing a syncpoint would first wait for the preceding job incrementing that syncpoint to complete, by waiting for the preceding job's fence value.

I would consider what I do in this patch to be an optimization of that. Let's say we detect a timed out job and just skip that job in the CDMA pushbuffer (but do not CPU-increment syncpoints), then at every subsequent job using that syncpoint, we will be detecting a timeout and skipping it eventually. With the "NOPping" in this patch we just pre-emptively cancel those jobs so that we don't have to spend time waiting for timeouts in the future. Functionally these should be the same, though.

The wait-for-preceding-job-to-complete thing should already be there in form of the "serialize" operation if the jobs use the same syncpoint.

So, if DRM scheduler's current operation is just skipping the timing out job and continuing from the next job, that's functionally fine. But we could improve DRM scheduler to allow for also cancelling future jobs that we know will time out. That would be in essence "virtual channel" support.

Userspace still has options -- if it puts in other prefences, timeouts will happen as usual. If it wants to have multiple "threads" of execution where a timeout on one doesn't affect the others, it can use different syncpoints for them.


I'll need to see a full-featured driver implementation and the test
cases that cover all the problems that you're worried about because I'm
not aware about all the T124+ needs and seeing code should help. Maybe
in the end yours approach will be the best, but for now it's not clear :)


My primary goal is simplicity of programming model and implementation. Regarding the resource management concerns, I can of course create a test case that allocates a lot of resources, but what I'm afraid about is that once we put this into a big system, with several VMs with their own resource reservations (including syncpoints), and the GPU and camera subsystems using hundreds of syncpoints, dynamic usage of those resources will create uncertainty in the system, and bug reports.

And of course, if we want to make a safety-related system, you also need to document before-hand how you are ensuring that e.g. job submission (including syncpoint allocation if that is dynamic) happens under x microseconds.

I don't think the model used in the grate host1x driver is bad, and I think the code there and its integration with the existing kernel frameworks are beautiful, and that is definitely a goal for the mainline driver as well. But I think we can make things even simpler overall and more reliable.

Mikko



[Index of Archives]     [ARM Kernel]     [Linux ARM]     [Linux ARM MSM]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux