[RFC] Using DC in amdgpu for upcoming GPU

Alexander.Deucher@xxxxxxx (Deucher, Alexander) · Fri, 9 Dec 2016 17:32:24 +0000

> -----Original Message-----
> From: Dave Airlie [mailto:airlied at gmail.com]
> Sent: Thursday, December 08, 2016 3:07 PM
> To: Daniel Vetter
> Cc: Wentland, Harry; dri-devel; Grodzovsky, Andrey; amd-gfx mailing list;
> Deucher, Alexander; Cheng, Tony
> Subject: Re: [RFC] Using DC in amdgpu for upcoming GPU
> 
> > I can't dig into details of DC, so this is not a 100% assessment, but if
> > you call a function called "validate" in atomic_commit, you're very, very
> > likely breaking atomic. _All_ validation must happen in ->atomic_check,
> > if that's not the case TEST_ONLY mode is broken. And atomic userspace is
> > relying on that working.
> >
> > The only thing that you're allowed to return from ->atomic_commit is
> > out-of-memory, hw-on-fire and similar unforeseen and catastrophic issues.
> > Kerneldoc expklains this.
> >
> > Now the reason I bring this up (and we've discussed it at length in
> > private) is that DC still suffers from a massive abstraction midlayer. A
> > lot of the back-end stuff (dp aux, i2c, abstractions for allocation,
> > timers, irq, ...) have been cleaned up, but the midlayer is still there.
> > And I understand why you have it, and why it's there - without some OS
> > abstraction your grand plan of a unified driver across everything doesn't
> > work out so well.
> >
> > But in a way the backend stuff isn't such a big deal. It's annoying since
> > lots of code, and bugfixes have to be duplicated and all that, but it's
> > fairly easy to fix case-by-case, and as long as AMD folks stick around
> > (which I fully expect) not a maintainance issue. It makes it harder for
> > others to contribute, but then since it's mostly the leaf it's generally
> > easy to just improve the part you want to change (as an outsider). And if
> > you want to improve shared code the only downside is that you can't also
> > improve amd, but that's not so much a problem for non-amd folks ;-)
> >
> > The problem otoh with the abstraction layer between drm core and the
> amd
> > driver is that you can't ignore if you want to refactor shared code. And
> > because it's an entire world of its own, it's much harder to understand
> > what the driver is doing (without reading it all). Some examples of what I
> > mean:
> >
> > - All other drm drivers subclass drm objects (by embedding them) into the
> >   corresponding hw part that most closely matches the drm object's
> >   semantics. That means even when you have 0 clue about how a given
> piece
> >   of hw works, you have a reasonable chance of understanding code. If it's
> >   all your own stuff you always have to keep in minde the special amd
> >   naming conventions. That gets old real fast if you trying to figure out
> >   what 20+ (or are we at 30 already?) drivers are doing.
> >
> > - This is even more true for atomic. Atomic has a pretty complicated
> >   check/commmit transactional model for updating display state. It's a
> >   standardized interface, and it's extensible, and we want generic
> >   userspace to be able to run on any driver. Fairly often we realize that
> >   semantics of existing or newly proposed properties and state isn't
> >   well-defined enough, and then we need to go&read all the drivers and
> >   figure out how to fix up the mess. DC has it's entirely separate state
> >   structures which again don't subclass the atomic core structures (afaik
> >   at least). Again the same problems apply that you can't find things, and
> >   that figuring out the exact semantics and spotting differences in
> >   behaviour is almost impossible.
> >
> > - The trouble isn't just in reading code and understanding it correctly,
> >   it's also in finding it. If you have your own completely different world
> >   then just finding the right code is hard - cscope and grep fail to work.
> >
> > - Another issue is that very often we unify semantics in drivers by adding
> >   some new helpers that at least dtrt for most of the drivers. If you have
> >   your own world then the impendance mismatch will make sure that amd
> >   drivers will have slightly different semantics, and I think that's not
> >   good for the ecosystem and kms - people want to run a lot more than just
> >   a boot splash with generic kms userspace, stuff like xf86-video-$vendor
> >   is going out of favour heavily.
> >
> > Note that all this isn't about amd walking away and leaving an
> > unmaintainable mess behind. Like I've said I don't think this is a big
> > risk. The trouble is that having your own world makes it harder for
> > everyone else to understand the amd driver, and understanding all drivers
> > is very often step 1 in some big refactoring or feature addition effort.
> > Because starting to refactor without understanding the problem generally
> > doesn't work ;_) And you can't make this step 1 easier for others by
> > promising to always maintain DC and update it to all the core changes,
> > because that's only step 2.
> >
> > In all the DC discussions we've had thus far I haven't seen anyone address
> > this issue. And this isn't just an issue in drm, it's pretty much
> > established across all linux subsystems with the "no midlayer or OS
> > abstraction layers in drivers" rule. There's some real solid reasons why
> > such a HAl is extremely unpopular with upstream. And I haven't yet seen
> > any good reason why amd needs to be different, thus far it looks like a
> > textbook case, and there's been lots of vendors in lots of subsystems who
> > tried to push their HAL.
> 
> Daniel has said this all very nicely, I'm going to try and be a bit more direct,
> because apparently I've possibly been too subtle up until now.
> 
> No HALs. We don't do HALs in the kernel. We might do midlayers sometimes
> we try not to do midlayers. In the DRM we don't do either unless the
> maintainers
> are asleep. They might be worth the effort for AMD, however for the Linux
> kernel
> they don't provide a benefit and make maintaining the code a lot harder. I've
> maintained this code base for over 10 years now and I'd like to think
> I've only merged
>  something for semi-political reasons once (initial exynos was still
> more Linuxy than DC),
> and that thing took a lot of time to cleanup, I really don't feel like
> saying yes again.
> 
> Given the choice between maintaining Linus' trust that I won't merge
> 100,000 lines
> of abstracted HAL code and merging 100,000 lines of abstracted HAL code
> I'll give you one guess where my loyalties lie. The reason the
> toplevel maintainer (me)
> doesn't work for Intel or AMD or any vendors, is that I can say NO
> when your maintainers
> can't or won't say it.
> 
> I've only got one true power as a maintainer, and that is to say No.
> The other option
> is I personally sit down and rewrite all the code in an acceptable
> manner, and merge that
> instead. But I've discovered I probably don't scale to that level, so
> again it leaves me
> with just the one actual power.
> 
> AMD can't threaten not to support new GPUs in upstream kernels without
> merging this,
> that is totally something you can do, and here's the thing Linux will
> survive, we'll piss off
> a bunch of people, but the Linux kernel will just keep on rolling
> forward, maybe at some
> point someone will get pissed about lacking upstream support for your
> HW and go write
> support and submit it, maybe they won't. The kernel is bigger than any
> of us and has
> standards about what is acceptable. Read up on the whole mac80211
> problems we had
> years ago, where every wireless vendor wrote their own 80211 layer
> inside their driver,
> there was a lot of time spent creating a central 80211 before any of
> those drivers were
> suitable for merge, well we've spent our time creating a central
> modesetting infrastructure,
> bypassing it is taking a driver in totally the wrong direction.
> 
> I've also wondered if the DC code is ready for being part of the
> kernel anyways, what
> happens if I merge this, and some external contributor rewrites 50% of
> it and removes a
> bunch of stuff that the kernel doesn't need. By any kernel standards
> I'll merge that sort
> of change over your heads if Alex doesn't, it might mean you have to
> rewrite a chunk
> of your internal validation code, or some other interactions, but
> those won't be reasons
> to block the changes from my POV. I'd like some serious introspection
> on your team's
> part on how you got into this situation and how even if I was feeling
> like merging this
> (which I'm not) how you'd actually deal with being part of the Linux
> kernel and not hiding
> in nicely framed orgchart silo behind a HAL. I honestly don't think
> the code is Linux worthy
> code, and I also really dislike having to spend my Friday morning
> being negative about it,
> but hey at least I can have a shower now.
> 
> No.

Hi Dave,

I think this is part of the reason a lot of people get fed up with working upstream in Linux.  I can respect your technical points and if you kept it to that, I'd be fine with it and we could have a technical discussion starting there.  But attacking us or our corporate culture is not cool.  I think perhaps you have been in the RH silo for too long.  Our corporate culture is not like RH's.  Like it or not, we have historically been a windows centric company.  We have a few small Linux team that has been engaged with the community for a long time, but the rest of the company has not.  We are working to improve it, but we can only do so many things at one time.  GPU cycles are fast.  There's only so much time in the day; we'd like to make our code perfect, but we also want to get it out to customers while the hw is still relevant.  We are finally at a point where our AMD Linux drivers are almost feature complete compared to windows and we have support upstream well before hw launch and we get shit on for trying to do the right thing.  It doesn't exactly make us want to continue contributing.  That's the problem with Linux.  Unless you are part time hacker who is part of the "in" crowd can spend all of his days tinkering with making the code perfect, a vendor with massive resources who can just through more people at it, or a throw it over the wall and forget it vendor (hey, my code can just live in staging), there's no room for you.

You love to tell the exynos story about how crappy the code was and then after it was cleaned up how glorious it was. Except the vendor didn't do that.  Another vendor paid another vendor to do it.  We don't happen to have the resources to pay someone else to do that for us.  Moreover, doing so would negate all of the advantages to bringing up the code along with the hw team in the lab when the asics come back from the fab.  Additionally, the original argument against the exynos code was that it was just thrown over the wall and largely ignored by the vendor once it was upstream.  We've been consistently involved in upstream (heck, I've been at AMD almost 10 years now maintaining our drivers).  You talk about trust.  I think there's something to cutting a trusted partner some slack as they work to further improve their support vs. taking a hard line because you got burned once by a throw it over the wall vendor who was not engaged.  Even if you want to take a hard line, let's discuss it on technical merits, not mud-slinging.

I realize you care about code quality and style, but do you care about stable functionality?  Would you really merge a bunch of huge cleanups that would potentially break tons of stuff in subtle ways because coding style is that important?  I'm done with that myself.  I've merged too many half-baked cleanups and new features in the past and ended up spending way more time fixing them than I would have otherwise for relatively little gain.  The hw is just too complicated these days.  At some point people what support for the hw they have and they want it to work.  If code trumps all, then why do we have staging?  

I understand forward progress on APIs, but frankly from my perspective, atomic has been a disaster for stability of both atomic and pre-atomic code.  Every kernel cycle manages to break several drivers.  What happened to figuring out how to do in right in a couple of drivers and then moving that to the core.  We seem to have lost that in favor of starting in the core first.  I feel like we constantly refactor the core to deal with that or that quirk or requirement of someone's hardware and then deal with tons of fallout.  Is all we care about android?  I constantly hear the argument, if we don't do all of this android will do their own thing and then that will be the end.  Right now we are all suffering and android barely even using this yet.  If Linux will carry on without AMD contributing maybe Linux will carry on ok without bending over backwards for android.  Are you basically telling us that you'd rather we water down our driver and limit the features and capabilities and stability we can support so that others can refactor our code constantly for hazy goals to support some supposed glorious future that never seems to come?  What about right now?  Maybe we could try and support some features right now.  Maybe we'll finally see Linux on the desktop.

Alex