On 07/18/2012 09:55 AM, Daniel Vetter wrote: > On Wed, Jul 18, 2012 at 09:23:46AM -0700, Ian Romanick wrote: >> On 07/18/2012 02:20 AM, Daniel Vetter wrote: >>> - The "all contexts in a share group need to receive a reset notification" >>> wording is irking me a bit because we currently only track all the >>> actively used things. So if another context in that share group isn't >>> affected (i.e. doesn't even use any of the potentially corrupted bos), >>> is the idea that the kernel grows the required tracking, or should >>> userspace always ask the reset state for all contexts and do the math >>> itself? >> >> There are a couple reasons that all contexts in a share group need >> to get the reset notification. Consider an application with two >> contexts, A and B. Context A is a worker context that does a bunch >> of render-to-texture operations, and context B will eventually >> consume those textures. If context A receives a reset, context B, >> even if it hasn't done any rendering in five minutes, has lost data. >> >> The kernel should never have any knowledge about GL share groups. >> This is where my simplifying assumption (that all contexts in a >> share group share an address space and an fd) comes in handy. If >> context A queries the reset status from the kernel first, it can >> reach over and poke the reset status of context B (in the gl_context >> structure). Likewise, if context B queries the kernel first, it can >> see that another kernel context in its GL context share group got >> reset. >> >>> - The above essentially links in with how we blame the guilt and figure >>> out who's affected. Especially when there's no hw context around, e.g. >>> on the blitter or bsd rings, or when an object is used by another >>> process and gets corrupted there. Does the spec make any guarantees for >>> such a case? Or should we just blame an unknown_reset for all the >>> contexts that belong to the fd that submitted the bad batch. >> >> That sounds right. If we can't assess innocence or guilt, we should >> say guilt is unknown. There are some GL commands that get generate >> blits, but I don't think there's anything that can get commands on >> the BSD ring. That's just for media, right? >> >>> As an idea for the above two issues, what about the kernel also keeps a >>> per-fd reset statistics, which will also be returned with this ioctl? >>> We'd then restrict the meaning of the ctx fields to only mean "and the >>> context was actually active". Userspace could then wrap all the per-fd >>> hang reports into reset_unknown for arb_robustness, but I think this would >>> be a bit more flexible for other userspace. >> >> Ah, right. So the DDX or libva could detect resets that affect >> them. That's reasonable. >> >>> struct drm_context_reset_counts { >>> __u32 ctx_id; >>> >>> /** >>> * Index of the most recent reset where this context was >>> * guilty. Zero if none. >>> */ >>> __u32 ctx_guilty; >>> >>> /** >>> * Index of the most recent reset where this context was active, >>> * not guilty. Zero if none. >>> */ >>> __u32 ctx_not_guilty; >>> >>> /** >>> * Index of the most recent reset where this context was active, >>> * but guilt was unknown. Zero if none. >>> */ >>> __u32 ctx_unknown_guilt; >>> >>> /** >>> * Index of the most recent reset where any batchbuffer submitted >>> * through this fd was guilty. Zero if none. >>> */ >>> __u32 fd_guilty; >>> >>> /** >>> * Index of the most recent reset where any batchbuffer submitted >>> * through this fd was not guilty, but affected. Zero if none. >>> */ >>> __u32 fd_not_guilty; >>> >>> /** >>> * Index of the most recent reset where any batchbuffer submitted >>> * through this fd was affected, but no guilt for the hang could >>> * be assigned. Zero if none. >>> */ >>> __u32 fd_unknown_guilt; >> >> Since these three fields are per-fd, shouldn't they go in the >> proposed drm_reset_counts structure instead? If we do that, it >> might be better to split into two queries. One for the per-fd >> information, and one for the detailed per-context information. If >> we expect the common case to be no-reset, user space could > > Oh, I've missed a bit that your ioctl would read out the state for all > contexts on a given fd. My idea was that this is all the data that the > ioctl retuns (and it would only fill out the ctx fields if you spec a > nonzero context). Userspace would then only need to check whether the > context has a new reset number somewhere (= context suffered from a gpu > reset while it was actively used) and then also check whether the fd > suffered from a reset (= something in that gl share group is potentially > corrupted), mapping it to either victim_reset or unknown_reset (depending > upon what exactly the spec wants). Consider the case where a process has four contexts, A, B, X, and Y. A and B are in a share group, and X and Y are in a share group not with A and B. B sees a reset. When the application queries the reset status on X or Y, it should not see a reset. Knowing that there was a reset for some other context on the same fd isn't enough information. > Less complexity in userspace, and imo not too much fuss for the kernel to > do the book-keeping: Instead of only blaming contexts, we also blame > file_privs. That would be a generally useful feature to e.g. cut off > certain fds selectively from the gpu because they hang it too fast - i.e. > to avoid the dreaded global "hanging to fast" when developing drivers ;-) > >>> }; >>> >>> The above could also be important to know when a hw context is toast and >>> should better not be used any more. >>> >>> - I like Chris' suggestion of using our seqno breadcrumbs for this. If you >>> also add a small extension to execbuf to return the seqno of the new >>> batch, you should also have all the tools in place to implement your >>> extension to notify userspace up to which glFence things have completed. >>> Note though that returning the seqno in execbuffer is only correct once >>> we've eliminated the flushing_list. >> >> It's possible, but I need to finish working out that idea (see >> below). I think the context only needs one seqno, not one per >> guiltiness level. "This is the last seqno that was retired before >> this context lost some data." > > The idea was to use the existing seqno as the reset counter in your ioctl. > It might wrap around, but you need to emit quite a few commands to get to > that point ;-) So we could use that right away. > >> That may still leave the context in a weird state. Think about this >> timeline: >> >> Application GPU >> draw A >> submit A >> draw B >> submit B >> drawing A completes >> reset (B is lost) >> draw C >> submit C >> drawing C completes >> query reset >> >> If / when we implement this feature, the kernel may need to drop any >> work submitted between a reset and an ack of the reset. Dunno. > > Well, we could simply add a robust_context flag, set at create time. If > such a context suffers from a gpu reset (while it's active) the kernel > could refuse to queue up more batches with -EIO. Userspace could then If we go that way, I think it would be better to just drop the requests. OpenGL doesn't let me start generating errors after a reset, so I'd have to selectively ignore these errors from the kernel. It sounds like it would add a bunch of annoying complexity to user space. Of course, we're talking about step 38 here, and we're still on step 2. > inquire the kernel with your ioctl about what's going on. Similarly we > could add such a flag to the fd, to catch everyone else not using hw > contexts. There we need a way to clear that flag though, so that you can > keep on using all the bo textures that survived (whereas hw contexts are > pretty much worthless on their own after reset, so we could demand that > they be destroyed). > >>> - Last but not least, how often is userspace expected to call this ioctl? >>> Around once per patchbuffer submission or way more/less? >> >> I don't know. This is a fairly new extension, and there are few >> users. As far as I'm aware, only Chrome and Firefox use it. I can >> find out some details from them. My guess is somewhere between once >> per frame and once per draw call. > > Cheers, Daniel