(adding danvet to here, as I'd imagine they might be interested in seeing some of this) Thank you for the descriptive write up. I think we can solve some of the problems you described here, however the patches that you submitted definitely won't work as-is. In patch 2, by reverting Intel back to using only 7 retries you technically break whatever monitor originally prompted us to bump the retry count up to 32 in the first place. And we have to try our hardest to avoid breaking people's displays when they were already working. Also, I'll definitely point out though that a few of the issues you've pointed out actually exist as workarounds for bad DisplayPort devices (which is a big reason we have these helpers), but I think we might be able to fix some of the issues you've mentioned by coming up with better workarounds. More details below On Thu, 2021-02-11 at 06:56 +0000, Almahallawy, Khaled wrote: > On Wed, 2021-02-10 at 13:03 -0500, Lyude Paul wrote: > > On Wed, 2021-02-10 at 00:33 -0800, Khaled Almahallawy wrote: > > > The number of AUX retries specified in the DP specs is 7. > > > Currently, to make > > > Dell 4k monitors happier, the number of retries are 32. > > > i915 also retries 5 times (intel_dp_aux_xfer) which means in the > > > case of AUX > > > timeout we actually retries 32 * 5 = 160 times. > > > > Is there any good reason for i915 to actually be doing retries > > itself? It seems > > like an easier solution here might just to be to fix i915 so it > > doesn't retry, > > and just rely on DRM to do the retries as appropriate. > > > > That being said though, I'm open to this if there is a legitimate > > reason for > > i915 to be handling retries on its own > > i915 or others may benefit from controlling AUX retries to optimize and > minimize timing spent on the aux transactions. > > drm_dpcd_access handles multiple aux retries cases the same way (retry > 32 times at worst case). The 4 cases are: > 1- *AUX receive or IO error*. I believe it is up to specific > implementation/HW once it detects a receive error to retry based on > their internal understanding using the timeout appropriate to the HW > capabilities. This makes sense so far > > 2- *AUX Timeout* As the driver follows the specs for the recommended > timeout timer as defined in section (2.11.2 AUX Transaction > Response/Reply Timeouts), the driver has more intelligence to know how > many retries it needs. This is more useful in case of DP-ALT if some > issues happen in Type-C stack that cause AUX timeout (e.g. USB3 dock > detected as high speed (USB2) causing SBU/AUX lines to be disabled). > Also the disconnect of MST hub/docks is dependent how fast a userspace > disconnect all MST connectors. Not all user spaces do it as fast like > Chrome (ubuntu is an example): > https://chromium-review.googlesource.com/c/chromium/src/+/2512550/ ; So - I'm not exactly following how this portion of the specification is relevant to the issue that you're bringing up here. If I understand section 2.11.2 correctly, a "response" here is defined as a transmission from the DPRX which informs us on the current state of the transaction that we've started. This would be any one of the aux response statuses you've mentioned in this email: AUX_NACK, AUX_ACK, or AUX_DEFER. It doesn't actually have anything to do with the number of retries we have to do, because (ignoring the fact we retry on AUX_NACKS in DRM for a moment here) that reply could be an AUX_DEFER which would indicate that we have to resend the transaction and try again. I think this is the correct understanding of section 2.11.2, because I definitely don't think real world DP devices are able to actually complete a full AUX transaction within a timespan of 300-400µs reliably for the most part. The second thing to mention here is that regardless of the first point, I don't think your point about MST displays needing to be removed by userspace is correct. It is true that userspace can actually hold onto references to removed MST connectors after they've been removed, but the main purpose of this being possible is in order to accommodate legacy clients that wouldn't be able to handle a connector disappearing under them cleanly. Once the MST topology which a connector corresponds to is removed, the MST connector is removed ASAP from the kernel's cache of the respective DisplayPort's MST topology along with all of the connectors below it. At the same time, the respective DRM connectors are also unregistered from userspace. DRM explicitly doesn't allow any kind of client to enable new displays on connectors that are unregistered, and will reject any modesets involving them with the exception of modesets which only disable displays on those unregistered connectors (this last part is mainly just to be nice to legacy clients). So it doesn't really matter how quickly userspace disables the display configuration on those connectors, from the kernel's perspective they're already gone and a new MST topology can be connected to the system. The expectation is that even if userspace doesn't turn those connectors off, the only possible move is to move the CRTCs on them to new connectors or disable them outright. Anyway-if there -is- actually a problem caused by userspace not disabling displays on those connectors, that's definitely not intentional and not how things are supposed to work. But this part of the MST helpers in DRM is pretty solid at this point, and most times when people point out this oddity with MST it ends up being a bug on userspace's end. Feel free to prove me wrong though! > > 3- *AUX_NACK* DP spec mentions retry 3 times for NACK(2.16.1.3 GTC Lock > Acquisition). However, sometimes we really don’t need any retry for > NACK, if DPRX replied AUX_NACK for DPCD that it doesn’t support (e.g. > reading LTTPR Capability and ID Field at DPCD F0000h ~ F0007). Ah, yes, -this-. Originally we did actually just abort any transaction on AUX_NACK, which is honestly the correct thing that we should be doing. But it looks like I actually changed this at one point after finding some DP devices that would send AUX_NACK instead of AUX_DEFER when they weren't yet ready to receive messages. Luckily it seems I wrote up a pretty long explanation around this when I changed this behavior back in: 82922da39190 ("drm/dp_helper: Retry aux transactions on all errors") This was almost five years ago though when I was quite new to working on DRM, so reading through this commit I already think I have some much better ideas for how we can handle issues with DisplayPort devices like this. For starters, this isn't the first workaround regarding a DisplayPort device or it's connected source device waking up. There's also: f808f63372cc ("drm/dp_helper: Perform throw-away read before actual read in drm_dp_dpcd_read()") Which was originally added as a workaround in the Intel driver, and then got moved into the DRM helpers by me. The important thing about these two workarounds that sticks out to me is that they're both issues with DP sinks/hubs that only happen when either the source is first connected to a sink or hub, or when the sink/hub is waking up from a low power state like D3. So it seems like in both of these instances, after we manage one "successful" transaction (where we define "successful" as both an AUX_ACK, -and- the monitor giving us a sensible reply instead of random garbage) then things start to become normal and match up with the spec. The commonality between these two workarounds makes me think that we could solve the AUX_NACK problem here (-and- this junky throwaway read) if we just kept track of whether or not we've managed a "successful" transaction at least once, after which point we can just immediately abort on any AUX_NACK we receive like we used to. Which would solve the issue you're mentioning here with our handling of AUX_NACKS, without breaking the monitors that actually need those workarounds. First, we would we add a field to drm_dp_aux called "ready". Then, we would want AUX transactions to go like this: 1. Whenever either of the following events occur: 1a. A new DP device being connected to the system 1b. Bringing an already-connected DP device out of a low power state through DPCD register 00600h or some other mechanism We set the "ready" field to False 2. Then, when the driver attempts an AUX channel transaction. We start to attempt a single DPCD register read from 00000h and retry until that transaction has completed with AUX_ACK. Take note that we will retry this transaction a total of 32 times like usual, but we will keep retrying even in the face of an AUX_NACK. 3. If the aforementioned transaction never completes with AUX_ACK, we consider the driver's DPCD transaction to have failed and return the appropriate return code. However-if the transaction does complete with AUX_ACK, we set the "ready" field to true. 4. We then attempt the original AUX transaction that the driver requested. 5. For the transaction in 4, and any subsequent transactions, as long as "ready" is set to True we go with the same 32 retries on AUX_DEFERs, but abort the transaction immediately on an AUX_NACK. > 4- *AUX_DEFER* The specs stated we may retry 7 times on AUX_DEFER > (3.5.1.2.3 LANEx_CHANNEL_EQ_DONE Sequence) and may terminate LT. Also > with the increased usage of USB4, TBT/Type-C Docks/Displays, and active > cables, the use of LTTPR becomes common which adds more challenge > especially if we have multiple LTTPRs and all operate in non-LTTPR > mode. In this case all LTTPRs will reply AUX_DEFER in 300us if it did > not receive any aux response from other LTTPRs and DPRX. The SCR states > we need to retry 7 times and not more than 50ms (DP v2.0 SCR on 8b/10b > DPTX and LTTPR Requirements Update to Section 3.6) I'm not sure where you're getting 7 retries and not more then 50ms from. I currently have a copy of the DP v2.0 standard, but I'm not sure what SCR is. Is this some sort of update to Section 3.6 from the DP v2.0 standard? Because I see some mentions of 50ms response times in my copy of the 2.0 standard regarding LTTPR, but they don't at all look related to what you're talking about here. (If it is some update I don't have access to, I'll poke the X.org VESA contacts and see if they can get me access to this). Regardless though, I'm still not sure I understand the issue here. If we're retrying a transaction, it's because the transaction didn't succeed - which in turn means that something went wrong on the DPRX's end. In the event of something going wrong with the DPRX for long enough that we end up exceeding that 50ms timeout, wouldn't that already mean that the link training process is already failing and needs to be aborted? Or are you saying that we would receive AUX_NACKs in such an event, which could cause us to keep retrying the transaction for longer then 50ms, resulting in the DPRX ending the link training prematurely? If it's the former, hopefully the solution I suggested for your third point would end up fixing this anyway (but I'm always open to discussion if that solution isn't enough). > > In addition I believe this function is not correct in treating > AUX_DEFER and AUX_NACK as -EIO. Especially for AUX_DEFER, it is a valid > 1 byte response that can provide a valuable debugging information > especially in the case of on-board LTTPR where there is no way to > capture this AUX response between the SoC and LTTPR unless you solder > two wires on AUX_P and AUX_N lines and use i2c/aux analyzer to decode. > At least we should provide the same debug information as we do in > drm_dp_i2c_do_msg by printing AUX_DEFER and AUX_NACK. > > Thank you for your feedback and review. Is the debugging output for DPCD transactions (e.g. setting bit 0x100 for drm.debug on the kernel commandline) not sufficient enough for this kind of debugging? I'm fine with us being more specific with our return codes in that case, I just wonder if they would conflict with any of the error codes some of the DRM DP helpers already return. Anyway-let me know if there's anything in my responses I need to clarify, or anything I missed here. Also, sorry for the very long response! There was a lot of context I had to dump here for this to make sense. > --Khaled > > > > > So making the number of aux retires a variable to allow for fine > > > tuning and > > > optimization of aux timing. > > > > > > Signed-off-by: Khaled Almahallawy <khaled.almahallawy@xxxxxxxxx> > > > --- > > > drivers/gpu/drm/drm_dp_helper.c | 10 +++------- > > > include/drm/drm_dp_helper.h | 1 + > > > 2 files changed, 4 insertions(+), 7 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/drm_dp_helper.c > > > b/drivers/gpu/drm/drm_dp_helper.c > > > index eedbb48815b7..8fdf57b4a06c 100644 > > > --- a/drivers/gpu/drm/drm_dp_helper.c > > > +++ b/drivers/gpu/drm/drm_dp_helper.c > > > @@ -249,13 +249,7 @@ static int drm_dp_dpcd_access(struct > > > drm_dp_aux *aux, u8 > > > request, > > > > > > mutex_lock(&aux->hw_mutex); > > > > > > - /* > > > - * The specification doesn't give any recommendation on how > > > often to > > > - * retry native transactions. We used to retry 7 times like > > > for > > > - * aux i2c transactions but real world devices this wasn't > > > - * sufficient, bump to 32 which makes Dell 4k monitors > > > happier. > > > - */ > > > - for (retry = 0; retry < 32; retry++) { > > > + for (retry = 0; retry < aux->num_retries; retry++) { > > > if (ret != 0 && ret != -ETIMEDOUT) { > > > usleep_range(AUX_RETRY_INTERVAL, > > > AUX_RETRY_INTERVAL + 100); > > > @@ -1744,6 +1738,8 @@ void drm_dp_aux_init(struct drm_dp_aux *aux) > > > aux->ddc.retries = 3; > > > > > > aux->ddc.lock_ops = &drm_dp_i2c_lock_ops; > > > + /*Still making the Dell 4k monitors happier*/ > > > + aux->num_retries = 32; > > > } > > > EXPORT_SYMBOL(drm_dp_aux_init); > > > > > > diff --git a/include/drm/drm_dp_helper.h > > > b/include/drm/drm_dp_helper.h > > > index edffd1dcca3e..16cbfc8f5e66 100644 > > > --- a/include/drm/drm_dp_helper.h > > > +++ b/include/drm/drm_dp_helper.h > > > @@ -1876,6 +1876,7 @@ struct drm_dp_aux { > > > struct mutex hw_mutex; > > > struct work_struct crc_work; > > > u8 crc_count; > > > + int num_retries; > > > ssize_t (*transfer)(struct drm_dp_aux *aux, > > > struct drm_dp_aux_msg *msg); > > > /** -- Sincerely, Lyude Paul (she/her) Software Engineer at Red Hat Note: I deal with a lot of emails and have a lot of bugs on my plate. If you've asked me a question, are waiting for a review/merge on a patch, etc. and I haven't responded in a while, please feel free to send me another email to check on my status. I don't bite! _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx