On 1/11/2024 11:25, Kim, Jonathan wrote: >> This looks OK. The compiler must be warning about a potential problem >> here, not a definite one. >> >> Question for Jon, how does the firmware encode the error code in the >> context ID? I see these macros: >> >> #define KFD_DEBUG_CP_BAD_OP_ECODE_MASK 0x3fffc00 >> #define KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT 10 >> #define KFD_DEBUG_CP_BAD_OP_ECODE(ctxid0) (((ctxid0) & \ >> KFD_DEBUG_CP_BAD_OP_ECODE_MASK) \ >> >> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT) >> >> It looks like we have 16 bits for the ECODE. That's enough to have a bit >> mask. Do we really need KFD_EC_MASK to convert an error number into a >> bitmask here? > > Added Jay for confirmation. > I could be wrong but IIRC (and I'm quite fuzzy on this ... probably should document this), unlike the wave trap code interrupt mask (bit mask) the CP bad op code is a single error code that directly points to one of the exception code enums that we defined in the user API header. > If that's the case, the KFD_EC_MASK is convenient for the kfd debugger code to mask the payload to send to the debugger or runtime. > If that's been wrong this whole time (i.e. the bad ops code is actually a bitwise mask of ecodes), then I'm not sure how we were able to get away with running the runtime negative tests for as long as we have and we'd need to recheck those tests. That's right. Queue errors are serialized. The error code is recorded directly. Wavefront errors may occur concurrently within a wavefront. Those are recorded as a bitmask. >> In the code above, if ecode is 0, that would lead to calling >> kfd_set_dbg_ev_from_interrupt with a event mask of 0. Not sure if that >> even makes sense. Jon, so we need special handling of cases where the >> error code is 0 or out of range, so we can warn about buggy firmware >> rather than creating nonsensical events for the debugger? > > That makes sense. Again, deferring to Jay if a NULL cp bad op code is expected under any circumstances. > Either way, raising undefined events to the debugger or runtime isn't useful so range checking to filter out non-encoded cp bad op interrupts would be needed. On AQL queues this interrupt carries an error code beginning from 16.