Re: [PATCH 01/13] drm/amdgpu: introduce and honour DRM_FORCE_AUTH workaround

Emil Velikov <emil.l.velikov@xxxxxxxxx> · Wed, 29 May 2019 17:29:08 +0100

On 2019/05/29, Koenig, Christian wrote:
> Am 29.05.19 um 15:03 schrieb Emil Velikov:
> > On 2019/05/29, Dave Airlie wrote:
> >> On Wed, 29 May 2019 at 02:47, Emil Velikov <emil.l.velikov@xxxxxxxxx> wrote:
> >>> On 2019/05/28, Koenig, Christian wrote:
> >>>> Am 28.05.19 um 18:10 schrieb Emil Velikov:
> >>>>> On 2019/05/28, Daniel Vetter wrote:
> >>>>>> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian
> >>>>>> <Christian.Koenig@xxxxxxx> wrote:
> >>>>>>> Am 28.05.19 um 09:38 schrieb Daniel Vetter:
> >>>>>>>> [SNIP]
> >>>>>>>>> Might be a good idea looking into reverting it partially, so that at
> >>>>>>>>> least command submission and buffer allocation is still blocked.
> >>>>>>>> I thought the issue is a lot more than vainfo, it's pretty much every
> >>>>>>>> hacked up compositor under the sun getting this wrong one way or
> >>>>>>>> another. Thinking about this some more, I also have no idea how you'd
> >>>>>>>> want to deprecate rendering on primary nodes in general. Apparently
> >>>>>>>> that breaks -modesetting already, and probably lots more compositors.
> >>>>>>>> And it looks like we're finally achieve the goal kms set out to 10
> >>>>>>>> years ago, and new compositors are sprouting up all the time. I guess
> >>>>>>>> we could just break them all (on new hardware) and tell them to all
> >>>>>>>> suck it up. But I don't think that's a great option. And just
> >>>>>>>> deprecating this on amdgpu is going to be even harder, since then
> >>>>>>>> everywhere else it'll keep working, and it's just amdgpu.ko that looks
> >>>>>>>> broken.
> >>>>>>>>
> >>>>>>>> Aside: I'm not supporting Emil's idea here because it fixes any issues
> >>>>>>>> Intel has - Intel doesn't care. I support it because reality sucks,
> >>>>>>>> people get this render vs. primary vs. multi-gpu prime wrong all the
> >>>>>>>> time (that's also why we have hardcoded display+gpu pairs in mesa for
> >>>>>>>> the various soc combinations out there), and this looks like a
> >>>>>>>> pragmatic solution. It'd be nice if every compositor and everything
> >>>>>>>> else would perfectly support multi gpu and only use render nodes for
> >>>>>>>> rendering, and only primary nodes for display. But reality is that
> >>>>>>>> people hack on stuff until gears on screen and then move on to more
> >>>>>>>> interesting things (to them). So I don't think we'll ever win this :-/
> >>>>>>> Yeah, but this is a classic case of working around user space issues by
> >>>>>>> making kernel changes instead of fixing user space.
> >>>>>>>
> >>>>>>> Having privileged (output control) and unprivileged (rendering control)
> >>>>>>> functionality behind the same node is a mistake we have made a long time
> >>>>>>> ago and render nodes finally seemed to be a way to fix that.
> >>>>>>>
> >>>>>>> I mean why are compositors using the primary node in the first place?
> >>>>>>> Because they want to have access to privileged resources I think and in
> >>>>>>> this case it is perfectly ok to do so.
> >>>>>>>
> >>>>>>> Now extending unprivileged access to the primary node actually sounds
> >>>>>>> like a step into the wrong direction to me.
> >>>>>>>
> >>>>>>> I rather think that we should go down the route of completely dropping
> >>>>>>> command submission and buffer allocation through the primary node for
> >>>>>>> non master clients. And then as next step at some point drop support for
> >>>>>>> authentication/flink.
> >>>>>>>
> >>>>>>> I mean we have done this with UMS as well and I don't see much other way
> >>>>>>> to move forward and get rid of those ancient interface in the long term.
> >>>>>> Well kms had some really good benefits that drove quick adoption, like
> >>>>>> "suspend/resume actually has a chance of working" or "comes with
> >>>>>> buffer management so you can run multiple gears".
> >>>>>>
> >>>>>> The render node thing is a lot more niche use case (prime, better priv
> >>>>>> separation), plus "it's cleaner design". And the "cleaner design" part
> >>>>>> is something that empirically doesn't seem to matter :-/ Just two
> >>>>>> examples:
> >>>>>> - KHR_display/leases just iterated display resources on the fd needed
> >>>>>> for rendering (and iirc there was even a patch to expose that for
> >>>>>> render nodes too so it works with DRI3), because implementing
> >>>>>> protocols is too hard. Barely managed to stop that one before it
> >>>>>> happened.
> >>>>>> - Various video players use the vblank ioctl on directly to schedule
> >>>>>> frames, without telling the compositor. I discovered that when I
> >>>>>> wanted to limite the vblank ioctl to master clients only. Again,
> >>>>>> apparently too hard to use the existing extensions, or fix the bugs in
> >>>>>> there, or whatever. One userspace got fixed last year, but it'll
> >>>>>> probably get copypasted around forever :-/
> >>>>>>
> >>>>>> So I don't think we'll ever manage to roll a clean split out, and best
> >>>>>> we can do is give in and just hand userspace what it wants. As much as
> >>>>>> that's misguided and unclean and all that. Maybe it'll result in a
> >>>>>> least fewer stuff getting run as root to hack around this, because
> >>>>>> fixing properly seems not to be on the table.
> >>>>>>
> >>>>>> The beauty of kms is that we've achieved the mission, everyone's
> >>>>>> writing their own thing. Which is also terrible, and I don't think
> >>>>>> it'll get better.
> >>>>> With the risk of coming rude I will repeat my earlier comment:
> >>>>>
> >>>>> The problem is _neither_ Intel nor libva specific.
> >>>>>
> >>>>>
> >>>>>
> >>>>> That said, let's step back for a moment and consider:
> >>>>>
> >>>>>    - the "block everything but KMS via the primary node" idea is great but
> >>>>> orthogonal
> >>>>>
> >>>>>    - the series does address issues that are vendor-agnostic
> >>>>>
> >>>>>    - by default this series does _not_ cause any regression be that for
> >>>>> new or old userspace
> >>>>>
> >>>>>    - there are two trivial solutions, if the AMD team has concerns about
> >>>>> closed-source/private stack depending on the old behaviour
> >>>>> If they want I can even write the patches ;-)
> >>>>>
> >>>>>
> >>>>> That said, the notable comments received so far are:
> >>>>>    - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from
> >>>>> handle. I'm OK but this will change the return code - from EACCES to
> >>>>> ENOSYS
> >>>>>
> >>>>>    - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is
> >>>>> planning to drop nearly all DRM_AUTH instances in their driver.
> >>>>>
> >>>>>
> >>>>> Christian, as mentioned before - this series does _not_ add
> >>>>> functionality to render nodes. It effectively paves a way towards
> >>>>> removing DRM_AUTH.
> >>>> But it adds functionality to the primary node.
> >>>>
> >>> Behaviour is adjusted - functionality was there since day 1.
> >>>
> >>>>> I understand the series may feel a bit dirty. Yet I would gladly address
> >>>>> any technical concerns you have.
> >>>> Well putting compatibility issues aside my concern is that this is
> >>>> simply a bad design decision which we can't revert later on.
> >>>>
> >>> As sad above - any concerns (theoretical or actual regressions) can be
> >>> trivially fixed _without_ reverting any of this.
> >>>
> >>> I am more than happy to step up and address any regressions in timely
> >>> manner.
> >>>
> >>>
> >>> As a reminder without this series, some of your customers are forced to
> >>> run their applications as root.
> >> I'm torn here on whether this is worth it. Have we got more use cases
> >> to justify it?
> >>
> > Should have mentioned: three DRM drivers (not counting i915) have
> > dropped DRM_AUTH, assumingly for the same reasons I'm bringing here.
> >
> > Apart from the libva, kmscube + gst and mesa, I'm expecting other
> > projects to make the same mistake. Since the former three define the
> > norm of using DRM.
> >
> > The "fix" for all of these being "run as root" :-\
> >
> >> I'm wary of opening this up just because we can.
> >>
> > What can I do to alleviate that worry? I have spent over a week auditing
> > code and designed so that we can reinstate the authentication only where
> > needed.
> 
> Well I don't think the worry here is about regressions,
Glad to hear.

> but rather about 
> a design decision we will never be able to revert.
> 
Can you think of any reason/issue why we would want to revert this? I
will gladly spend some thing exploring how to address it.

> So the question we have to ask is rather if it's a good design decision 
> to resurrect the primary node with all its related compability burdens 
> to work around an issue which is essentially an userspace coding error.
> 

Can see you're not happy on the topic - I'm not too excited either. The
truth to the matter is - DRM drivers have dropped DRM_AUTH regardless of
my work.

It's very unfortunate, if AMDGPU stands out. Perhaps after some time and
unhappy users you'll reconsider.

I believe that Linus has pointed out a number of times that kernel
developers should care about our users. Even when it's an userspace
error.

HTH
Emil
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel