On 11.10.2017 11:02, Christian König wrote: >> 1.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set all >> their fence status to â??*ECANCELED*â?? >> >> Setting ECANCELED should be ok. But I think we should do this when we >> try to run the jobs and not during GPU reset. >> >> [ML] without deep thought and expritment, Iâ??m not sure the difference >> between them, but kick it out in gpu_reset routine is more efficient, >> > I really don't think so. Kicking them out during gpu_reset sounds racy > to me once more. > > And marking them canceled when we try to run them has the clear > advantage that all dependencies are meet first. This makes sense to me as well. It raises a vaguely related question: What happens to jobs whose dependencies were canceled? I believe we currently don't check those errors, so we might execute them anyway if their contexts were unaffected by the reset. There's a risk that the job will hang due to stale data. I don't think it's a huge risk in practice today because we don't have a lot of buffer sharing between applications, but it's something to think through at some point. In a way, canceling out of an abundance of caution may be a bad idea because it could kill a compositor's task by being overly conservative. Cheers, Nicolai