Re: [RFC] Using DC in amdgpu for upcoming GPU

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dave,

Apologies for waking you up with the RFC on a Friday morning. I'll try to time big stuff better next time.

A couple of thoughts below after having some discussions internally. I think Tony might add to some of them or provide his own.

On 2016-12-11 09:57 PM, Dave Airlie wrote:
On 8 December 2016 at 12:02, Harry Wentland <harry.wentland@xxxxxxx> wrote:
We propose to use the Display Core (DC) driver for display support on
AMD's upcoming GPU (referred to by uGPU in the rest of the doc). In order to
avoid a flag day the plan is to only support uGPU initially and transition
to older ASICs gradually.

[FAQ: from past few days]

1) Hey you replied to Daniel, you never addressed the points of the RFC!
I've read it being said that I hadn't addressed the RFC, and you know
I've realised I actually had, because the RFC is great but it
presupposes the codebase as designed can get upstream eventually, and
I don't think it can. The code is too littered with midlayering and
other problems, that actually addressing the individual points of the
RFC would be missing the main point I'm trying to make.

This code needs rewriting, not cleaning, not polishing, it needs to be
split into its constituent parts, and reintegrated in a form more
Linux process friendly.

I feel that if I reply to the individual points Harry has raised in
this RFC, that it means the code would then be suitable for merging,
which it still won't, and I don't want people wasting another 6
months.

If DC was ready for the next-gen GPU it would be ready for the current
GPU, it's not the specific ASIC code that is the problem, it's the
huge midlayer sitting in the middle.

2) We really need to share all of this code between OSes, why does
Linux not want it?

Sharing code is a laudable goal and I appreciate the resourcing
constraints that led us to the point at which we find ourselves, but
the way forward involves finding resources to upstream this code,
dedicated people (even one person) who can spend time on a day by day
basis talking to people in the open and working upstream, improving
other pieces of the drm as they go, reading atomic patches and
reviewing them, and can incrementally build the DC experience on top
of the Linux kernel infrastructure. Then having the corresponding
changes in the DC codebase happen internally to correspond to how the
kernel code ends up looking. Lots of this code overlaps with stuff the
drm already does, lots of is stuff the drm should be doing, so patches
to the drm should be sent instead.


Personally I'm with you on this and hope to get us there. I'm learning... we're learning. I agree that changes on atomic, removing abstractions, etc. should happen on dri-devel.

When it comes to brand-new technologies (MST, Freesync), though, we're often the first which means that we're spending a considerable amount of time to get things right, working with HW teams, receiver vendors and other partners internal and external to AMD. By the time we do get it right it's time to hit the market. This gives us fairly little leeway to work with the community on patches that won't land in distros for another half a year. We're definitely hoping to improve some of this but it's not easy and in some case impossible ahead of time (though definitely possibly after initial release).


3) Then how do we upstream it?
Resource(s) need(s) to start concentrating at splitting this thing up
and using portions of it in the upstream kernel. We don't land fully
formed code in the kernel if we can avoid it. Because you can't review
the ideas and structure as easy as when someone builds up code in
chunks and actually develops in the Linux kernel. This has always
produced better more maintainable code. Maybe the result will end up
improving the AMD codebase as well.

4) Why can't we put this in staging?
People have also mentioned staging, Daniel has called it a dead end,
I'd have considered staging for this code base, and I still might.
However staging has rules, and the main one is code in staging needs a
TODO list, and agreed criteria for exiting staging, I don't think we'd
be able to get an agreement on what the TODO list should contain and
how we'd ever get all things on it done. If this code ended up in
staging, it would most likely require someone dedicated to recreating
it in the mainline driver in an incremental fashion, and I don't see
that resource being available.


I don't think we really want staging. If it helps us get into DRM, sure, but if it's more of a pain, as suggested, then probably no.

5) Why is a midlayer bad?
I'm not going to go into specifics on the DC midlayer, but we abhor
midlayers for a fair few reasons. The main reason I find causes the
most issues is locking. When you have breaks in code flow between
multiple layers, but having layers calling back into previous layers
it becomes near impossible to track who owns the locking and what the
current locking state is.


There's a conscious design decision to have absolutely no locking in DC. This is one of the reasons. Locking is really OS dependent behavior which has no place in DC.

Consider
    drma -> dca -> dcb -> drmb
    drmc -> dcc  -> dcb -> drmb

We have two codes paths that go back into drmb, now maybe drma has a
lock taken, but drmc doesn't, but we've no indication when we hit drmb
of what the context pre entering the DC layer is. This causes all
kinds of problems. The main requirement is the driver maintains the
execution flow as much as possible. The only callback behaviour should
be from an irq or workqueue type situations where you've handed
execution flow to the hardware to do something and it is getting back
to you. The pattern we use to get our of this sort of hole is helper
libraries, we structure code as much as possible as leaf nodes that
don't call back into the parents if we can avoid it (we don't always
succeed).


Is that the reason for using ww_mutex in atomic?

So the above might becomes
   drma-> dca_helper
           -> dcb_helper
           -> drmb.

In this case the code flow is controlled by drma, dca/dcb might be
modifying data or setting hw state but when we get to drmb it's easy
to see what data is needs and what locking.


This actually looks pretty close to

drm_atomic_commit
	-> amdgpu_dm_atomic_commit
	-> dc_commit_targets > dce110_apply_ctx_to_hw
	-> apply_single_controller_ctx_to_hw
	-> core_link_enable_stream > allocate_mst_payload
	-> dm_helpers_dp_mst_write_payload_allocation_table
		-> drm_dp_update_payload_part1

though the latter is a bit more complex.

DAL/DC goes against this in so many ways, and when I look at the code
I'm never sure where to even start pulling the thread to unravel it.


There's a lot of code there but that doesn't mean it's needlessly complex. We're definitely open for suggestions on how to simplify this, ideally without breaking existing functionality.

Some questions I have for AMD engineers that also I'd want to see
addressed before any consideration of merging would happen!

How do you plan on dealing with people rewriting or removing code
upstream that is redundant in the kernel, but required for internal
stuff?

There's already a bunch of stuff in our internal trees that never make it into open-source trees, for various reasons. We guard those with an #ifdef and strip them when preparing code for open source. It shouldn't be a big deal to deal with code removed upstream in similar ways.

Rewritten code would have to be looked at on a case by case basis. DC code is fully validated in many different configurations and is used for ASIC bringup when we can sit next to HW guys to work out complex issues. Modifying the code in a way that can't be shared would mean that all this validation is lost. Some of the bugs we're talking about are non-trivial and will show up only if HW is programmed in a certain way (e.g. Linux code leaves out some power-saving feature, causing HW to hang in weird scenarios).

How are you going to deal with new Linux things that overlap
incompatibly with your internally developed stuff?

Do you have examples?

If we're talking about stuff like MST, atomic, FreeSync, HDR... we're generally the first to the game and would love to be working with the community to push those out.

If the code is upstream will it be tested in the kernel by some QA
group, or will there be some CI infrastructure used to maintain and to
watch for Linux code that breaks assumptions in the DC code?

I think Alex is working on getting our internal tree onto a rolling tip of drm-next (or nearly there). Once we got this we'll be switching our existing builds and manual (no automated yet) testing onto. We're currently building daily and with each DC commit and doing testing of a basic feature matrix at least every second day.

Can you show me you understand that upstream code is no longer 100% in
your control and things can happen to it that you might not expect and
you need to deal with it?


I think this is the big question. I would love to let other AMDers chime in on this.

Harry

Dave.

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux