> I can't dig into details of DC, so this is not a 100% assessment, but if > you call a function called "validate" in atomic_commit, you're very, very > likely breaking atomic. _All_ validation must happen in ->atomic_check, > if that's not the case TEST_ONLY mode is broken. And atomic userspace is > relying on that working. > > The only thing that you're allowed to return from ->atomic_commit is > out-of-memory, hw-on-fire and similar unforeseen and catastrophic issues. > Kerneldoc expklains this. > > Now the reason I bring this up (and we've discussed it at length in > private) is that DC still suffers from a massive abstraction midlayer. A > lot of the back-end stuff (dp aux, i2c, abstractions for allocation, > timers, irq, ...) have been cleaned up, but the midlayer is still there. > And I understand why you have it, and why it's there - without some OS > abstraction your grand plan of a unified driver across everything doesn't > work out so well. > > But in a way the backend stuff isn't such a big deal. It's annoying since > lots of code, and bugfixes have to be duplicated and all that, but it's > fairly easy to fix case-by-case, and as long as AMD folks stick around > (which I fully expect) not a maintainance issue. It makes it harder for > others to contribute, but then since it's mostly the leaf it's generally > easy to just improve the part you want to change (as an outsider). And if > you want to improve shared code the only downside is that you can't also > improve amd, but that's not so much a problem for non-amd folks ;-) > > The problem otoh with the abstraction layer between drm core and the amd > driver is that you can't ignore if you want to refactor shared code. And > because it's an entire world of its own, it's much harder to understand > what the driver is doing (without reading it all). Some examples of what I > mean: > > - All other drm drivers subclass drm objects (by embedding them) into the > corresponding hw part that most closely matches the drm object's > semantics. That means even when you have 0 clue about how a given piece > of hw works, you have a reasonable chance of understanding code. If it's > all your own stuff you always have to keep in minde the special amd > naming conventions. That gets old real fast if you trying to figure out > what 20+ (or are we at 30 already?) drivers are doing. > > - This is even more true for atomic. Atomic has a pretty complicated > check/commmit transactional model for updating display state. It's a > standardized interface, and it's extensible, and we want generic > userspace to be able to run on any driver. Fairly often we realize that > semantics of existing or newly proposed properties and state isn't > well-defined enough, and then we need to go&read all the drivers and > figure out how to fix up the mess. DC has it's entirely separate state > structures which again don't subclass the atomic core structures (afaik > at least). Again the same problems apply that you can't find things, and > that figuring out the exact semantics and spotting differences in > behaviour is almost impossible. > > - The trouble isn't just in reading code and understanding it correctly, > it's also in finding it. If you have your own completely different world > then just finding the right code is hard - cscope and grep fail to work. > > - Another issue is that very often we unify semantics in drivers by adding > some new helpers that at least dtrt for most of the drivers. If you have > your own world then the impendance mismatch will make sure that amd > drivers will have slightly different semantics, and I think that's not > good for the ecosystem and kms - people want to run a lot more than just > a boot splash with generic kms userspace, stuff like xf86-video-$vendor > is going out of favour heavily. > > Note that all this isn't about amd walking away and leaving an > unmaintainable mess behind. Like I've said I don't think this is a big > risk. The trouble is that having your own world makes it harder for > everyone else to understand the amd driver, and understanding all drivers > is very often step 1 in some big refactoring or feature addition effort. > Because starting to refactor without understanding the problem generally > doesn't work ;_) And you can't make this step 1 easier for others by > promising to always maintain DC and update it to all the core changes, > because that's only step 2. > > In all the DC discussions we've had thus far I haven't seen anyone address > this issue. And this isn't just an issue in drm, it's pretty much > established across all linux subsystems with the "no midlayer or OS > abstraction layers in drivers" rule. There's some real solid reasons why > such a HAl is extremely unpopular with upstream. And I haven't yet seen > any good reason why amd needs to be different, thus far it looks like a > textbook case, and there's been lots of vendors in lots of subsystems who > tried to push their HAL. Daniel has said this all very nicely, I'm going to try and be a bit more direct, because apparently I've possibly been too subtle up until now. No HALs. We don't do HALs in the kernel. We might do midlayers sometimes we try not to do midlayers. In the DRM we don't do either unless the maintainers are asleep. They might be worth the effort for AMD, however for the Linux kernel they don't provide a benefit and make maintaining the code a lot harder. I've maintained this code base for over 10 years now and I'd like to think I've only merged something for semi-political reasons once (initial exynos was still more Linuxy than DC), and that thing took a lot of time to cleanup, I really don't feel like saying yes again. Given the choice between maintaining Linus' trust that I won't merge 100,000 lines of abstracted HAL code and merging 100,000 lines of abstracted HAL code I'll give you one guess where my loyalties lie. The reason the toplevel maintainer (me) doesn't work for Intel or AMD or any vendors, is that I can say NO when your maintainers can't or won't say it. I've only got one true power as a maintainer, and that is to say No. The other option is I personally sit down and rewrite all the code in an acceptable manner, and merge that instead. But I've discovered I probably don't scale to that level, so again it leaves me with just the one actual power. AMD can't threaten not to support new GPUs in upstream kernels without merging this, that is totally something you can do, and here's the thing Linux will survive, we'll piss off a bunch of people, but the Linux kernel will just keep on rolling forward, maybe at some point someone will get pissed about lacking upstream support for your HW and go write support and submit it, maybe they won't. The kernel is bigger than any of us and has standards about what is acceptable. Read up on the whole mac80211 problems we had years ago, where every wireless vendor wrote their own 80211 layer inside their driver, there was a lot of time spent creating a central 80211 before any of those drivers were suitable for merge, well we've spent our time creating a central modesetting infrastructure, bypassing it is taking a driver in totally the wrong direction. I've also wondered if the DC code is ready for being part of the kernel anyways, what happens if I merge this, and some external contributor rewrites 50% of it and removes a bunch of stuff that the kernel doesn't need. By any kernel standards I'll merge that sort of change over your heads if Alex doesn't, it might mean you have to rewrite a chunk of your internal validation code, or some other interactions, but those won't be reasons to block the changes from my POV. I'd like some serious introspection on your team's part on how you got into this situation and how even if I was feeling like merging this (which I'm not) how you'd actually deal with being part of the Linux kernel and not hiding in nicely framed orgchart silo behind a HAL. I honestly don't think the code is Linux worthy code, and I also really dislike having to spend my Friday morning being negative about it, but hey at least I can have a shower now. No. Dave.