On Mon, Jan 30, 2023 at 09:48:31AM +0100, Andi Shyti wrote: > Hi GG, > > thanks for the deep analysis! > > > Hi Andi, > > In the call flow invoked by intel_pcode_init(), I've added brief comments > > where further clarification is needed in this scenario, and a description of > > the suspicious scenario at the bottom. > > > > ------------------------------------------------------------------------- > > intel_pcode_init() > > | > > +-> skl_pcode_request(uncore, DG1_PCODE_STATUS, > > DG1_UNCORE_GET_INIT_STATUS, > > DG1_UNCORE_INIT_STATUS_COMPLETE, > > DG1_UNCORE_INIT_STATUS_COMPLETE, 180000); > > | > > +-> skl_pcode_try_request() > > | > > +-> *status = __snb_pcode_rw(uncore, mbox, &request, NULL, > > 500, 0, true); > > > > ------------------------------------------------------------------------- > > static int __snb_pcode_rw(struct intel_uncore *uncore, u32 mbox, > > u32 *val, u32 *val1, > > int fast_timeout_us, int slow_timeout_ms, > > bool is_read) > > { > > ... > > /* Before writing a value to the GEN6_PCODE_DATA register, > > check if the bit in the GEN6_PCODE_MAILBOX register indicates > > BUSY. */ > > if (intel_uncore_read_fw(uncore, GEN6_PCODE_MAILBOX) & GEN6_PCODE_READY) > > return -EAGAIN; > > what if we fail here because the punit is still initializing and > will be ready, say, in 10 seconds? > > GG, without going any further, we fail here! The -EAGAIN we > receive from the test comes from this point. We don't fail with > -ETIMEDOUT, but with -EAGAIN and the reason is because the punit > is not ready to perform the very fist communication and we fail > the probing. > > It doesn't mean, though, that there is anything wrong, we just > need to wait a bit before "taking drastic decisions"! > > This patch is suggesting to wait up to 10s for the punit to be > ready and eventually try to probe again... and, indeed, it works! As GG, what I still don't understand is how this extra 10 seconds wait helps... have you tried to simple add the 10 to the 180 and make the code 190 sec instead? > > Andi > > > > > /* write value to GEN6_PCODE_DATA register */ > > intel_uncore_write_fw(uncore, GEN6_PCODE_DATA, *val); > > > > intel_uncore_write_fw(uncore, GEN6_PCODE_DATA1, val1 ? *val1 : 0); > > > > /* In this scenario, the value > > "DG1_PCODE_STATUS | GEN6_PCODE_READY" > > is written to the GEN6_PCODE_MAILBOX register, > > so that the Busy status of the GEN6_PCODE_MAILBOX register > > can be checked later. > > (When the value of the GEN6_PCODE_READY bit of the > > GEN6_PCODE_MAILBOX register changes to 0, the operation can > > be considered completed.) */ > > intel_uncore_write_fw(uncore, > > GEN6_PCODE_MAILBOX, GEN6_PCODE_READY | mbox); > > > > /* In this scenario, verify that the BUSY status bit in the > > GEN6_PCODE_MAILBOX register turns off for up to 500us. */ > > if (__intel_wait_for_register_fw(uncore, > > GEN6_PCODE_MAILBOX, > > GEN6_PCODE_READY, 0, > > fast_timeout_us, > > slow_timeout_ms, > > &mbox)) > > return -ETIMEDOUT; > > /* If there is a failure here, it may be considered that the > > "DG1_PCODE_STATUS | GEN6_PCODE_READY" operation was not > > completed within 500us */ > > ... > > } > > > > int skl_pcode_request(struct intel_uncore *uncore, u32 mbox, u32 request, > > u32 reply_mask, u32 reply, int timeout_base_ms) > > { > > u32 status; > > int ret; > > > > mutex_lock(&uncore->i915->sb_lock); > > > > #define COND \ > > skl_pcode_try_request(uncore, mbox, request, reply_mask, reply, &status) > > > > /* the first trial for skl_pcode_try_request() can return > > -EAGAIN or -ETIMEDOUT. And the code did not check the error > > code here, so we don't know how far the __snb_pcode_rw() > > function went. It is not known whether the pcode_mailbox > > status was busy before writing the value to the > > GEN6_PCODE_DATA register or after.*/ > > if (COND) { > > ret = 0; > > goto out; > > } > > > > /* In this scenario, skl_pcode_try_request() is invoked every > > 10us for 180 seconds. When skl_pcode_try_request() returns > > -EAGAIN and -ETIMEDOUT by _wait_for(), > > -ETIMEDOUT is returned to a variable ret. */ > > > > ret = _wait_for(COND, timeout_base_ms * 1000, 10, 10); > > > > if (!ret) > > goto out; > > > > /* > > * The above can time out if the number of requests was low (2 in the > > * worst case) _and_ PCODE was busy for some reason even after a > > * (queued) request and @timeout_base_ms delay. As a workaround retry > > * the poll with preemption disabled to maximize the number of > > * requests. Increase the timeout from @timeout_base_ms to 50ms to > > * account for interrupts that could reduce the number of these > > * requests, and for any quirks of the PCODE firmware that delays > > * the request completion. > > */ > > drm_dbg_kms(&uncore->i915->drm, > > "PCODE timeout, retrying with preemption disabled\n"); > > drm_WARN_ON_ONCE(&uncore->i915->drm, timeout_base_ms > 3); > > preempt_disable(); > > ret = wait_for_atomic(COND, 50); > > > > preempt_enable(); > > > > out: > > mutex_unlock(&uncore->i915->sb_lock); > > return status ? status : ret; > > #undef COND > > } > > > > --------------------------------------------------------- > > > > If you try skl_pcode_try_request() for 180 seconds in skl_pcode_request(), > > and the first "intel_uncore_read_fw(uncore, GEN6_PCODE_MAILBOX) & > > GEN6_PCODE_READY)" call in __snb_pcode_rw() that skl_pcode_try_request() > > invokes always fails. if then it does not make sense to me why this patch > > fixes it by just waiting 10 seconds.This is because if it was called with > > the flow described above, 180 seconds is longer than 10 seconds, so the > > scenario you mentioned is also covered in the existing code. > > > > To describe in more detail the second scenario I previously commented on: > > skl_pcode_request() tries skl_pcode_try_request() for 180 seconds > > 1) In skl_pcode_try_request(), the GEN6_PCODE_MAILBOX register bit is not > > BUSY, so the value is written to the GEN6_PCODE_DATA1 register. > > 2) skl_pcode_try_request() returns -ETIMEDOUT value because the operation > > of 1) does not complete within 500us. > > 3) Scenario in which the GEN6_PCODE_MAILBOX register bit is checked as BUSY > > and returns -EAGAIN in the last call of skl_pcode_try_request() invoked by > > skl_pcode_request() > > > > If the reason why this problem occurred is because of this scenario, > > shouldn't there be an attempt to increase fast_timeout_us used as an > > argument of __snb_pcode_rw() to 500us or more when skl_pcode_try_request() > > returns -ETIMEDOUT? > > > > Br, > > G.G.