On 2024-01-16 14:44, Joshua Ashton wrote: > On 1/16/24 13:41, Joshua Ashton wrote: >> On 1/16/24 12:24, Joshua Ashton wrote: >>> On 1/16/24 07:47, Christian König wrote: >>>> Am 16.01.24 um 01:05 schrieb Marek Olšák: >>>>> On Mon, Jan 15, 2024 at 3:06 PM Christian König >>>>> <ckoenig.leichtzumerken@xxxxxxxxx> wrote: >>>>>> Am 15.01.24 um 20:30 schrieb Joshua Ashton: >>>>>>> On 1/15/24 19:19, Christian König wrote: >>>>>>>> Am 15.01.24 um 20:13 schrieb Joshua Ashton: >>>>>>>>> On 1/15/24 18:53, Christian König wrote: >>>>>>>>>> Am 15.01.24 um 19:35 schrieb Joshua Ashton: >>>>>>>>>>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote: >>>>>>>>>>>> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock >>>>>>>>>>>> <friedrich.vock@xxxxxx <mailto:friedrich.vock@xxxxxx>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Re-sending as plaintext, sorry about that >>>>>>>>>>>> >>>>>>>>>>>> On 15.01.24 18:54, Michel Dänzer wrote: >>>>>>>>>>>> > On 2024-01-15 18:26, Friedrich Vock wrote: >>>>>>>>>>>> >> [snip] >>>>>>>>>>>> >> The fundamental problem here is that not telling >>>>>>>>>>>> applications that >>>>>>>>>>>> >> something went wrong when you just canceled their work >>>>>>>>>>>> midway is an >>>>>>>>>>>> >> out-of-spec hack. >>>>>>>>>>>> >> When there is a report of real-world apps breaking >>>>>>>>>>>> because of >>>>>>>>>>>> that hack, >>>>>>>>>>>> >> reports of different apps working (even if it's >>>>>>>>>>>> convenient that they >>>>>>>>>>>> >> work) doesn't justify keeping the broken code. >>>>>>>>>>>> > If the breaking apps hit multiple soft resets in a row, >>>>>>>>>>>> I've laid >>>>>>>>>>>> out a pragmatic solution which covers both cases. >>>>>>>>>>>> Hitting soft reset every time is the lucky path. Once GPU >>>>>>>>>>>> work is >>>>>>>>>>>> interrupted out of nowhere, all bets are off and it might as >>>>>>>>>>>> well >>>>>>>>>>>> trigger a full system hang next time. No hang recovery should >>>>>>>>>>>> be able to >>>>>>>>>>>> cause that under any circumstance. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I think the more insidious situation is no further hangs but >>>>>>>>>>>> wrong results because we skipped some work. That we skipped work >>>>>>>>>>>> may e.g. result in some texture not being uploaded or some GPGPU >>>>>>>>>>>> work not being done and causing further errors downstream (say if >>>>>>>>>>>> a game is doing AI/physics on the GPU not to say anything of >>>>>>>>>>>> actual GPGPU work one might be doing like AI) >>>>>>>>>>> Even worse if this is compute on eg. OpenCL for something >>>>>>>>>>> science/math/whatever related, or training a model. >>>>>>>>>>> >>>>>>>>>>> You could randomly just get invalid/wrong results without even >>>>>>>>>>> knowing! >>>>>>>>>> Well on the kernel side we do provide an API to query the result of >>>>>>>>>> a submission. That includes canceling submissions with a soft >>>>>>>>>> recovery. >>>>>>>>>> >>>>>>>>>> What we just doesn't do is to prevent further submissions from this >>>>>>>>>> application. E.g. enforcing that the application is punished for >>>>>>>>>> bad behavior. >>>>>>>>> You do prevent future submissions for regular resets though: Those >>>>>>>>> increase karma which sets ctx->guilty, and if ctx->guilty then >>>>>>>>> -ECANCELED is returned for a submission. >>>>>>>>> >>>>>>>>> ctx->guilty is never true for soft recovery though, as it doesn't >>>>>>>>> increase karma, which is the problem this patch is trying to solve. >>>>>>>>> >>>>>>>>> By the submission result query API, I you assume you mean checking >>>>>>>>> the submission fence error somehow? That doesn't seem very ergonomic >>>>>>>>> for a Vulkan driver compared to the simple solution which is to just >>>>>>>>> mark it as guilty with what already exists... >>>>>>>> Well as I said the guilty handling is broken for quite a number of >>>>>>>> reasons. >>>>>>>> >>>>>>>> What we can do rather trivially is changing this code in >>>>>>>> amdgpu_job_prepare_job(): >>>>>>>> >>>>>>>> /* Ignore soft recovered fences here */ >>>>>>>> r = drm_sched_entity_error(s_entity); >>>>>>>> if (r && r != -ENODATA) >>>>>>>> goto error; >>>>>>>> >>>>>>>> This will bubble up errors from soft recoveries into the entity as >>>>>>>> well and makes sure that further submissions are rejected. >>>>>>> That makes sense to do, but at least for GL_EXT_robustness, that will >>>>>>> not tell the app that it was guilty. >>>>>> No, it clearly gets that signaled. We should probably replace the guilty >>>>>> atomic with a calls to drm_sched_entity_error(). >>>>>> >>>>>> It's just that this isn't what Marek and I had in mind for this, >>>>>> basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or >>>>>> AMDGPU_CTX_OP_QUERY_STATE2. >>>>>> >>>>>> Instead just look at the return value of the CS or query fence result IOCTL. >>>>>> >>>>>> When you get an -ENODATA you have been guilty of causing a soft >>>>>> recovery, when you get an -ETIME you are guilty of causing a timeout >>>>>> which had to be hard recovered. When you get an -ECANCELED you are an >>>>>> innocent victim of a hard recovery somebody else caused. >>>>>> >>>>>> What we haven't defined yet is an error code for loosing VRAM, but that >>>>>> should be trivial to do. >>>>> So far we have implemented the GPU reset and soft reset, but we >>>>> haven't done anything to have a robust system recovery. Under the >>>>> current system, things can easily keep hanging indefinitely because >>>>> nothing prevents that. >>>>> >>>>> The reset status query should stay. Robust apps will use it to tell >>>>> when they should recreate their context and resources even if they >>>>> don't submit anything. Let's fully trust robust apps here. In the >>>>> future we might change our mind about that, but for now, let's just >>>>> focus on API conformance, and later we can change it as long as we >>>>> stay API conformant. >>>>> >>>>> Non-robust apps must be terminated when they hang or are innocent but >>>>> affected. Their existence is a security and usability problem and a >>>>> source of frustrations for users. 100% guaranteed system recovery is >>>>> impossible if they continue to live. >>>>> >>>>> IBs should be rejected for all guilty and affected innocent contexts >>>>> unconditionally, both robust and non-robust ones, by the kernel. >>>>> Userspace only forwards the reset status to apps for robust contexts >>>>> and doesn't do anything else, but userspace may decide to terminate >>>>> the process if any non-robust context is affected. >>>> >>>> Yeah, that absolutely works for me. >>>> >>>> Going to adjust the implementation accordingly. >>> >>> Awesome, please CC me know when you have something. >>> >>> In the short-term I have changed if (r && r != -ENODATA) to if (r) and that seems to work nicely for me. >> >> One problem with solely relying on the CS submission return value from userspace is cancelled syncobj waits. >> >> For example, if we have an application with one thread that makes a submission, and then kicks off a vkWaitSemaphores to wait on a semaphore on another thread and that submission hangs, the syncobj relating to the vkWaitSemaphores should be signalled which is fine, but we need to return VK_ERROR_DEVICE_LOST if the context loss resulted in the signal for the VkSemaphore. >> >> The way this was previously integrated was with the query thing, which as we have established does not provide correct information regarding soft recovery at the moment. >> >> Unless you have an alternative for us to get some error out of the syncobj (eg. -ENODATA), then right now we still require the query. >> >> I think fixing the -ENODATA reporting back for submit is a good step, but I believe we still need the query to report the same information as we would have gotten from that here. > > Hmmm, actually the spec states that VK_SUCCESS is valid in this situation: > > Commands that wait indefinitely for device execution (namely vkDeviceWaitIdle, vkQueueWaitIdle, vkWaitForFences with a maximum timeout, and vkGetQueryPoolResults with the VK_QUERY_RESULT_WAIT_BIT bit set in flags) must return in finite time even in the case of a lost device, and return either VK_SUCCESS or VK_ERROR_DEVICE_LOST. That could be read as "Return VK_SUCCESS on success, VK_ERROR_DEVICE_LOST on device lost", couldn't it? -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and Xwayland developer