On 02/09/15 13:01, Konsta Hölttä wrote:
On 01/09/15 16:26, Ben Skeggs wrote:
On 31 August 2015 at 21:38, Konsta Hölttä <kholtta@xxxxxxxxxx> wrote:
Hi there,
Resending these now that they've had some more polish and testing,
and I heard
that Ben's vacation is over :-)
These patches work as a starting point for more explicit error
mechanisms and
better robustness. At the moment, when a job hangs or faults, it
seems that
nouveau doesn't quite know how to handle the situation and often
results in a
hang. Some of these situations would require either completely
resetting the
gpu, and/or a complex path for only recovering the broken channel.
Yes, this has been an issue that's been on my "really nice to
implement" list for quite a long time now, and only gotten around to
getting it partially done (as you can see). Thanks for looking at
this!
To start, I worked on support for letting userspace know what
exactly happened.
Proper recovery would come later. The "error notifier" in the first
patch is a
simple shared buffer between kernel and userspace. Its error codes
match
nvgpu's. Alternatively, the status could be queried with an ioctl,
but that
would be considerably more heavyweight. I'd like to know if the
event mechanism
is meant for these kinds of events at all (engines notify errors
upwards to the
drm layer). Another alternative would probably be to register the
same buffer
to all necessary engines separately in nvif method calls? Or
register it to
just one (e.g., fifo) and get that engine when errors happen in
others (e.g.,
gr)? And drm handles probably wouldn't fit there? Please comment on
this; I
wrote this before understanding the mthd mechanism.
For the moment, I've just got a couple of high-level
comments/questions on these patches so far before we worry about
nit-picking the details:
Is there any real need to have a full-blown notifier-style object to
pass this info back to userspace? The DRM has an event mechanism
where userspace could poll on its fd to get events, which is what I'd
assume you'd be doing inside of fence waits with the error notifier?
NVIF events are actually hooked up to this mechanism already, so
userspace could directly request them and cut quite a lot out of that
first patch. But, the NVKM-level passing of these events back over
NVIF is pretty much where I was going with the design too. So, good
to see you've done that :)
I hadn't heard about that event mechanism yet - do you mean just
drm_poll() and ->event_wait that is woken up in usif_notify()? That
doesn't have any values that could be passed to userland. Is there a
way for that too already? The motivation for the current approach is
that it's lightweight, not requiring any kernel calls (and is
compatible with nvidia gl drivers...)
Ah, to answer myself: that'd be drm_read(). Doh. IIUC, that applies to
nouveau's notify objects created from userspace? (route usif)
The poll and read calls would still be of more overhead than a plain
read in userspace, and that wouldn't be compatible with nvidia userspace
drivers. Might be doable with heavy refactoring and measuring if the
overhead is significant - I'm not familiar with that high-level part yet.
Konsta
I don't actually know all the details how higher-level parts of the
driver use the buffer :( just assuming things atm. I'd guess that the
buffer would be read after each submit (after a fence wait has
finished, which is a separate thing) if checking the status is
required. My understanding is that at least the desktop driver has (or
used to have) a shared buffer for lots of common stuff and this would
be part of it (hence the offset field), which would complicate the
whole driver stack if this were done differently. In Tegra's
libdrm-equivalent this is initialized on channel creation, so could be
in fact part of channel init args in nouveau?
I considered also that the buffer could be just one page and allocated
in the kernel, but as said, our current drivers want to be able to
pass a separate buffer there.
Additionally, priority and timeout management for separate channels
in flight
on the gpu is added in two patches. Neither is exactly what the name
says, but
the effect is the same, and this is what nvgpu does currently. Those
two
patches call the fifo channel object's methods directly from
userspace, so a
hack is added in the nvif path to accept that. The objects are NEW'd
from
kernel space, so calling from userspace isn't allowed, as it
appears. How
should this be managed in a clean way?
Is there any requirement to be able to adjust the priority of
in-flight channels? Or would it be satisfactory to have these as
arguments to the channel creation functions? Either way is fine, the
latter is preferred if there's no use case for adjusting it afterwards
though.
Maybe not exactly, but there is a technical requirement forced by
nvidia userspace driver that doesn't enforce this to be done at
creation time, but allows it at any time. I looked at that option too
when starting to write these patches and yes, that (specifying at init
time) would've simplified stuff.
It's certainly possible that the timeout/priority is set only later in
userspace after finding out that this is e.g. a webgl process
requiring to finish faster. That's one plausible usecase.
I'm attempting to get some better channel handling (you might have
noticed the current stuff the kernel uses is a bit fragile, it's
evolved over a long time and several generations of channel classes)
in place in time for 4.4, part of which will likely involve giving
userspace better control over the arguments that get passed at channel
creation time.
The current stuff is at least not trivial to grasp initially :) I'd
love to also help to write documentation for lots of things if time
permits - I've been taking some notes for myself, but a general nvkm
glossary and general reasoning about how things are done the way they
are would definitely help new folks to get to know the codebase
(wasn't easy for me). The problem is that I only could write about how
things are done, not why. Getting to know all that would require lots
of discussion (and time).
As for the permissions problem, that can be resolved fairly easily I
think, I'll play with a few ideas and see what I come up with.
Good to hear, thanks! A bit related question to this: what exactly
determines the level of control kernel vs userspace have for the
channels? Management rights for nouveau_bo's? Looks like some stuff of
nouveau_chan.c could live in userspace too.
Cheers,
Konsta
Thanks,
Ben.
Also, since nouveau often hangs on errors, the userspace hangs too
(waiting on
a fence). The final patch attempts to fix this in a couple of
specific error
paths to forcibly update all fences to be finished. I'd like to hear
how that
would be handled properly - consider the patch just a
proof-of-concept and
sample of what would be necessary.
I don't expect the patches to be accepted as-is - as a newbie, I'd
appreciate
any high-level comments on if I've understood anything, especially
the event
and nvif/method mechanisms (I use the latter from userspace with a hack
constructed from the perfmon branch seen here earlier into nvidia's
internal
libdrm-equivalent). The fence-forcing thing is something that is
necessary with
the error notifiers (at least with our userspace that waits really
long or
infinitely on fences). I'm working specifically on Tegra and don't
know much
about the desktop's userspace details, so I may be biased in some
areas.
I'd be happy to write sample tests on e.g. libdrm for the new
methods once the
kernel patches would get to a good shape, if that's required for
accepting new
features. I tested these to work as a proof-of-concept on Jetson
TK1, and the
code is adapted from the latest nvgpu.
The patches can also be found in http://github.com/sooda/nouveau and
are based
on a version of gnurou/staging.
Thanks!
Konsta (sooda in IRC)
Konsta Hölttä (5):
notify channel errors to userspace
don't verify route == owner in nvkm ioctl
gk104: channel priority/timeslice support
gk104: channel timeout detection
HACK force fences updated on error
drm/nouveau/include/nvif/class.h | 20 ++++
drm/nouveau/include/nvif/event.h | 12 +++
drm/nouveau/include/nvkm/engine/fifo.h | 5 +-
drm/nouveau/nouveau_chan.c | 95 +++++++++++++++++++
drm/nouveau/nouveau_chan.h | 10 ++
drm/nouveau/nouveau_drm.c | 1 +
drm/nouveau/nouveau_fence.c | 13 ++-
drm/nouveau/nouveau_gem.c | 29 ++++++
drm/nouveau/nouveau_gem.h | 2 +
drm/nouveau/nvkm/core/ioctl.c | 9 +-
drm/nouveau/nvkm/engine/fifo/base.c | 56 ++++++++++-
drm/nouveau/nvkm/engine/fifo/gf100.c | 2 +-
drm/nouveau/nvkm/engine/fifo/gk104.c | 166
++++++++++++++++++++++++++++++---
drm/nouveau/nvkm/engine/fifo/nv04.c | 2 +-
drm/nouveau/nvkm/engine/gr/gf100.c | 5 +
drm/nouveau/uapi/drm/nouveau_drm.h | 13 +++
16 files changed, 415 insertions(+), 25 deletions(-)
--
2.1.4
_______________________________________________
Nouveau mailing list
Nouveau@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/nouveau
--
To unsubscribe from this list: send the line "unsubscribe
linux-tegra" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
_______________________________________________
Nouveau mailing list
Nouveau@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/nouveau
--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html