On 2019/05/29, Koenig, Christian wrote: > Am 29.05.19 um 15:03 schrieb Emil Velikov: > > On 2019/05/29, Dave Airlie wrote: > >> On Wed, 29 May 2019 at 02:47, Emil Velikov <emil.l.velikov@xxxxxxxxx> wrote: > >>> On 2019/05/28, Koenig, Christian wrote: > >>>> Am 28.05.19 um 18:10 schrieb Emil Velikov: > >>>>> On 2019/05/28, Daniel Vetter wrote: > >>>>>> On Tue, May 28, 2019 at 10:03 AM Koenig, Christian > >>>>>> <Christian.Koenig@xxxxxxx> wrote: > >>>>>>> Am 28.05.19 um 09:38 schrieb Daniel Vetter: > >>>>>>>> [SNIP] > >>>>>>>>> Might be a good idea looking into reverting it partially, so that at > >>>>>>>>> least command submission and buffer allocation is still blocked. > >>>>>>>> I thought the issue is a lot more than vainfo, it's pretty much every > >>>>>>>> hacked up compositor under the sun getting this wrong one way or > >>>>>>>> another. Thinking about this some more, I also have no idea how you'd > >>>>>>>> want to deprecate rendering on primary nodes in general. Apparently > >>>>>>>> that breaks -modesetting already, and probably lots more compositors. > >>>>>>>> And it looks like we're finally achieve the goal kms set out to 10 > >>>>>>>> years ago, and new compositors are sprouting up all the time. I guess > >>>>>>>> we could just break them all (on new hardware) and tell them to all > >>>>>>>> suck it up. But I don't think that's a great option. And just > >>>>>>>> deprecating this on amdgpu is going to be even harder, since then > >>>>>>>> everywhere else it'll keep working, and it's just amdgpu.ko that looks > >>>>>>>> broken. > >>>>>>>> > >>>>>>>> Aside: I'm not supporting Emil's idea here because it fixes any issues > >>>>>>>> Intel has - Intel doesn't care. I support it because reality sucks, > >>>>>>>> people get this render vs. primary vs. multi-gpu prime wrong all the > >>>>>>>> time (that's also why we have hardcoded display+gpu pairs in mesa for > >>>>>>>> the various soc combinations out there), and this looks like a > >>>>>>>> pragmatic solution. It'd be nice if every compositor and everything > >>>>>>>> else would perfectly support multi gpu and only use render nodes for > >>>>>>>> rendering, and only primary nodes for display. But reality is that > >>>>>>>> people hack on stuff until gears on screen and then move on to more > >>>>>>>> interesting things (to them). So I don't think we'll ever win this :-/ > >>>>>>> Yeah, but this is a classic case of working around user space issues by > >>>>>>> making kernel changes instead of fixing user space. > >>>>>>> > >>>>>>> Having privileged (output control) and unprivileged (rendering control) > >>>>>>> functionality behind the same node is a mistake we have made a long time > >>>>>>> ago and render nodes finally seemed to be a way to fix that. > >>>>>>> > >>>>>>> I mean why are compositors using the primary node in the first place? > >>>>>>> Because they want to have access to privileged resources I think and in > >>>>>>> this case it is perfectly ok to do so. > >>>>>>> > >>>>>>> Now extending unprivileged access to the primary node actually sounds > >>>>>>> like a step into the wrong direction to me. > >>>>>>> > >>>>>>> I rather think that we should go down the route of completely dropping > >>>>>>> command submission and buffer allocation through the primary node for > >>>>>>> non master clients. And then as next step at some point drop support for > >>>>>>> authentication/flink. > >>>>>>> > >>>>>>> I mean we have done this with UMS as well and I don't see much other way > >>>>>>> to move forward and get rid of those ancient interface in the long term. > >>>>>> Well kms had some really good benefits that drove quick adoption, like > >>>>>> "suspend/resume actually has a chance of working" or "comes with > >>>>>> buffer management so you can run multiple gears". > >>>>>> > >>>>>> The render node thing is a lot more niche use case (prime, better priv > >>>>>> separation), plus "it's cleaner design". And the "cleaner design" part > >>>>>> is something that empirically doesn't seem to matter :-/ Just two > >>>>>> examples: > >>>>>> - KHR_display/leases just iterated display resources on the fd needed > >>>>>> for rendering (and iirc there was even a patch to expose that for > >>>>>> render nodes too so it works with DRI3), because implementing > >>>>>> protocols is too hard. Barely managed to stop that one before it > >>>>>> happened. > >>>>>> - Various video players use the vblank ioctl on directly to schedule > >>>>>> frames, without telling the compositor. I discovered that when I > >>>>>> wanted to limite the vblank ioctl to master clients only. Again, > >>>>>> apparently too hard to use the existing extensions, or fix the bugs in > >>>>>> there, or whatever. One userspace got fixed last year, but it'll > >>>>>> probably get copypasted around forever :-/ > >>>>>> > >>>>>> So I don't think we'll ever manage to roll a clean split out, and best > >>>>>> we can do is give in and just hand userspace what it wants. As much as > >>>>>> that's misguided and unclean and all that. Maybe it'll result in a > >>>>>> least fewer stuff getting run as root to hack around this, because > >>>>>> fixing properly seems not to be on the table. > >>>>>> > >>>>>> The beauty of kms is that we've achieved the mission, everyone's > >>>>>> writing their own thing. Which is also terrible, and I don't think > >>>>>> it'll get better. > >>>>> With the risk of coming rude I will repeat my earlier comment: > >>>>> > >>>>> The problem is _neither_ Intel nor libva specific. > >>>>> > >>>>> > >>>>> > >>>>> That said, let's step back for a moment and consider: > >>>>> > >>>>> - the "block everything but KMS via the primary node" idea is great but > >>>>> orthogonal > >>>>> > >>>>> - the series does address issues that are vendor-agnostic > >>>>> > >>>>> - by default this series does _not_ cause any regression be that for > >>>>> new or old userspace > >>>>> > >>>>> - there are two trivial solutions, if the AMD team has concerns about > >>>>> closed-source/private stack depending on the old behaviour > >>>>> If they want I can even write the patches ;-) > >>>>> > >>>>> > >>>>> That said, the notable comments received so far are: > >>>>> - rework patch 13/13 to remove the DRM_AUTH from prime fd to/from > >>>>> handle. I'm OK but this will change the return code - from EACCES to > >>>>> ENOSYS > >>>>> > >>>>> - vmwgfx will need a check on the reference ioctl(s) - IIRC Thomas is > >>>>> planning to drop nearly all DRM_AUTH instances in their driver. > >>>>> > >>>>> > >>>>> Christian, as mentioned before - this series does _not_ add > >>>>> functionality to render nodes. It effectively paves a way towards > >>>>> removing DRM_AUTH. > >>>> But it adds functionality to the primary node. > >>>> > >>> Behaviour is adjusted - functionality was there since day 1. > >>> > >>>>> I understand the series may feel a bit dirty. Yet I would gladly address > >>>>> any technical concerns you have. > >>>> Well putting compatibility issues aside my concern is that this is > >>>> simply a bad design decision which we can't revert later on. > >>>> > >>> As sad above - any concerns (theoretical or actual regressions) can be > >>> trivially fixed _without_ reverting any of this. > >>> > >>> I am more than happy to step up and address any regressions in timely > >>> manner. > >>> > >>> > >>> As a reminder without this series, some of your customers are forced to > >>> run their applications as root. > >> I'm torn here on whether this is worth it. Have we got more use cases > >> to justify it? > >> > > Should have mentioned: three DRM drivers (not counting i915) have > > dropped DRM_AUTH, assumingly for the same reasons I'm bringing here. > > > > Apart from the libva, kmscube + gst and mesa, I'm expecting other > > projects to make the same mistake. Since the former three define the > > norm of using DRM. > > > > The "fix" for all of these being "run as root" :-\ > > > >> I'm wary of opening this up just because we can. > >> > > What can I do to alleviate that worry? I have spent over a week auditing > > code and designed so that we can reinstate the authentication only where > > needed. > > Well I don't think the worry here is about regressions, Glad to hear. > but rather about > a design decision we will never be able to revert. > Can you think of any reason/issue why we would want to revert this? I will gladly spend some thing exploring how to address it. > So the question we have to ask is rather if it's a good design decision > to resurrect the primary node with all its related compability burdens > to work around an issue which is essentially an userspace coding error. > Can see you're not happy on the topic - I'm not too excited either. The truth to the matter is - DRM drivers have dropped DRM_AUTH regardless of my work. It's very unfortunate, if AMDGPU stands out. Perhaps after some time and unhappy users you'll reconsider. I believe that Linus has pointed out a number of times that kernel developers should care about our users. Even when it's an userspace error. HTH Emil _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx