Re: [PATCH] drm/etnaviv: Create an accel device node if compute-only

Lucas Stach <l.stach@xxxxxxxxxxxxxx> · Wed, 26 Jun 2024 10:28:55 +0200

Am Mittwoch, dem 26.06.2024 um 09:28 +0200 schrieb Daniel Vetter:
> On Mon, Jun 17, 2024 at 07:01:05PM +0200, Tomeu Vizoso wrote:
> > Hi Lucas,
> > 
> > Do you have any idea on how not to break userspace if we expose a render node?
> 
> So if you get a new chip with an incompatible 3d block, you already have
> that issue. And I hope etnaviv userspace can cope.
> 
> Worst case you need to publish a fake extremely_fancy_3d_block to make
> sure old mesa never binds against an NPU-only instance.
> 
> Or mesa just doesn't cope, in which case we need a etnaviv-v2-we_are_sorry
> drm driver name, or something like that.

Mesa doesn't cope right now. Mostly because of the renderonly thing
where we magically need to match render devices to otherwise render
incapable KMS devices. The way this matching works is that the
renderonly code tries to open a screen on a rendernode and if that
succeeds we treat it as the matching render device.

The core of the issue is that we have no way of specifying which kind
of screen we need at that point, i.e. if the screen should have 3D
render capabilities or if compute-only or even NN-accel-only would be
okay. So we can't fail screen creation if there is no 3D engine, as
this would break the teflon case, which needs a screen for the NN
accel, but once we successfully create a screen reanderonly might treat
the thing as a rendering device.
So we are kind of stuck here between breaking one or the other use-
case. I'm leaning heavily into the direction of just fixing Mesa, so we
can specify the type of screen we need at creation time to avoid the
renderonly issue, porting this change as far back as reasonably
possible and file old userspace into shit-happens.

Regards,
Lucas

> 
> > 
> > Cheers,
> > 
> > Tomeu
> > 
> > On Wed, Jun 12, 2024 at 4:26 PM Tomeu Vizoso <tomeu@xxxxxxxxxxxxxxx> wrote:
> > > 
> > > On Mon, May 20, 2024 at 1:19 PM Daniel Stone <daniel@xxxxxxxxxxxxx> wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > On Mon, 20 May 2024 at 08:39, Tomeu Vizoso <tomeu@xxxxxxxxxxxxxxx> wrote:
> > > > > On Fri, May 10, 2024 at 10:34 AM Lucas Stach <l.stach@xxxxxxxxxxxxxx> wrote:
> > > > > > Am Mittwoch, dem 24.04.2024 um 08:37 +0200 schrieb Tomeu Vizoso:
> > > > > > > If we expose a render node for NPUs without rendering capabilities, the
> > > > > > > userspace stack will offer it to compositors and applications for
> > > > > > > rendering, which of course won't work.
> > > > > > > 
> > > > > > > Userspace is probably right in not questioning whether a render node
> > > > > > > might not be capable of supporting rendering, so change it in the kernel
> > > > > > > instead by exposing a /dev/accel node.
> > > > > > > 
> > > > > > > Before we bring the device up we don't know whether it is capable of
> > > > > > > rendering or not (depends on the features of its blocks), so first try
> > > > > > > to probe a rendering node, and if we find out that there is no rendering
> > > > > > > hardware, abort and retry with an accel node.
> > > > > > 
> > > > > > On the other hand we already have precedence of compute only DRM
> > > > > > devices exposing a render node: there are AMD GPUs that don't expose a
> > > > > > graphics queue and are thus not able to actually render graphics. Mesa
> > > > > > already handles this in part via the PIPE_CAP_GRAPHICS and I think we
> > > > > > should simply extend this to not offer a EGL display on screens without
> > > > > > that capability.
> > > > > 
> > > > > The problem with this is that the compositors I know don't loop over
> > > > > /dev/dri files, trying to create EGL screens and moving to the next
> > > > > one until they find one that works.
> > > > > 
> > > > > They take the first render node (unless a specific one has been
> > > > > configured), and assumes it will be able to render with it.
> > > > > 
> > > > > To me it seems as if userspace expects that /dev/dri/renderD* devices
> > > > > can be used for rendering and by breaking this assumption we would be
> > > > > breaking existing software.
> > > > 
> > > > Mm, it's sort of backwards from that. Compositors just take a
> > > > non-render DRM node for KMS, then ask GBM+EGL to instantiate a GPU
> > > > which can work with that. When run in headless mode, we don't take
> > > > render nodes directly, but instead just create an EGLDisplay or
> > > > VkPhysicalDevice and work backwards to a render node, rather than
> > > > selecting a render node and going from there.
> > > > 
> > > > So from that PoV I don't think it's really that harmful. The only
> > > > complication is in Mesa, where it would see an etnaviv/amdgpu/...
> > > > render node and potentially try to use it as a device. As long as Mesa
> > > > can correctly skip, there should be no userspace API implications.
> > > > 
> > > > That being said, I'm not entirely sure what the _benefit_ would be of
> > > > exposing a render node for a device which can't be used by any
> > > > 'traditional' DRM consumers, i.e. GL/Vulkan/winsys.
> > > 
> > > What I don't understand yet from Lucas proposal is how this isn't
> > > going to break existing userspace.
> > > 
> > > I mean, even if we find a good way of having userspace skip
> > > non-rendering render nodes, what about existing userspace that isn't
> > > able to do that? Any updates to newer kernels are going to break them.
> > > 
> > > Regards,
> > > 
> > > Tomeu
>