On 02/27/2013 01:13 AM, Chris Wilson wrote: > On Tue, Feb 26, 2013 at 05:47:12PM -0800, Ian Romanick wrote: >> On 02/26/2013 03:05 AM, Mika Kuoppala wrote: >>> For arb-robustness, every context needs to have it's own >>> reset state tracking. Default context will be handled in a identical >>> way as the no-context case in further down in the patch set. >>> For no-context case, the reset state will be stored in >>> the file_priv part. >>> >>> v2: handle default context inside get_reset_state >> >> This isn't the interface we want. I already sent you the patches >> for Mesa, and you seem to have completely ignored them. Moreover, >> this interface provides no mechanism to query for its existence >> (other than relying on the kernel version), and no method to >> deprecate it. > > It's existence, and the existence of any successor, is the only test you > need to check for the interface. Testing the flag field up front also > lets you know if individual flags are known prior to use. > >> Based on e-mail discussions, I think danvet agrees with me here. >> >> Putting guilty / innocent counting in kernel puts policy decisions >> in the kernel that belong with the user space API that implements >> them. Putting these choices in the kernel significantly decreases >> how "future proof" the interface is. Since any kernel/user >> interface has to be kept forever, this is just asking for >> maintenance trouble down the road. > > I think you have the policy standpoint reversed. Can you elaborate? I've been out of the kernel loop for a long time, but I always thought policy and mechanism were supposed to be separated. In the case of drivers with a user-mode component, the mechanism was in the kernel (where else could it be?), and policy decisions were in user-mode. This is often at odds with keeping the interfaces between kernel and user thin, so that may be where my misunderstanding is. >> Also, it's difficult (for me) to tell from the rest of the series >> whether or not a context that was not affected by a reset (had no >> batches in flight at all) can still observe that a reset occurred. > > Yes, the context can observe that the global reset increased, its did > not. See below for a couple reasons why I think this may be a mistake. >> From the GL point of view, once a context has observed a reset, it's >> garbage. Knowing that it saw 1 reset or 43,000,000 resets is the >> same. Since it's a count, GL will have to query the values before >> any rendering happens or it won't know whether the value 6 means >> there was a reset or not. >> >> The right interface is a set of flags indicating the state the >> context was in when it observed a reset. As this patch series >> stands, Mesa will not use this interface. > > This interface conforms to your specifications and interface that you > last posted to the list and were revised on list. There are two > substantive differences; there is *less* policy in the kernel than > before, and the breadcrumb of last reset state was overlooked. > > In order to make the reset status work in a multi-threaded environment > the kernel exists in, we have to assume that there will be multiple > actors on the reset status, and they will be comparing the state that > they are interested in. The set of flags that GL understands is > deducible from this interface. We had some discussion on the list about a proposed interface last year. We had a really good discussion, and both you and Daniel provided some really valuable feedback. The biggest piece of feedback that both of you provided was to simplify. I took that advice to heart. The original interface was clunky and over complicated. If memory serves, that discussion was on the order of 8 to 10 months ago. Quite a lot has happened since then. The biggest thing that happened is I started implementing the interface we discussed in the kernel, libdrm, and Mesa. In that process I found a number of problems with even the simplified interface. Back in September I used this new data to simplify the interface even further. This allowed both the kernel and the user-mode code to be much less complex. Right about that same time the whole Mesa team got super busy. First we had OpenGL 3.1, then we had OpenGL ES 3.0. During that time all of my ARB_robustness work languished. I didn't want to send another round of not-fully-baked ideas (or patches) to the list, so they just sat. And sat. And sat. :( At FOSDEM earlier this month, I discussed this with Jesse. I told him that there was approximately epsilon probability that I would get back to this work. As a result, I was going to have to throw it over the wall to the kernel team. At no point did he mention that anyone was already working on this. About a week ago Daniel pulled me into a discussion with Mika about the ARB_robustness work. I dug out my old code, paged back in the memory of the work from off-line storage, and provided a bunch of feedback to Mika and Daniel. After I sent around my explanation of why I abandoned the counting interface and circulated my work, there was no response from Mika or anyone on that team. I had a brief discussion with Daniel on IRC, and he seemed to agree with my sentiments and like the direction of my code. Then this patch series appeared on the list, and here we are. There are two requirements in the ARB_robustness extension (and additional layered extensions) that I believe cause problems with all of the proposed counting interfaces. 1. GL contexts that are not affected by a reset (i.e., didn't lose any objects, didn't lose any rendering, etc.) should not observe it. This is the big one that all the browser vendors want. If you have 5 tabs open, a reset caused by a malicious WebGL app in one tab shouldn't, if possible, force them to rebuild all of their state for all the other tabs. 2. After a GL context observes a reset, it is garbage. It must be destroyed and a new one created. This means we only need to know that a reset of any variety affected the context. In addition, I'm concerned about having one GL context be able to observe, in any manner, resets that did not affect it. Given that people have figured out how to get encryption keys by observing cache timings, it's not impossible for a malicious program to use information about reset counts to do something. Leaving a potential gap like this in an extension that's used to improved security seems... like we're just asking for a Phoronix article. We don't have any requirement to expose this information, so I don't think we should. We also don't have any requirement to expose this functionality on anything pre-Sandy Bridge. We may want, at some point, to expose it on Ironlake. Based on that, it seems reasonable to tie the availability of reset notifications to hardware contexts. Since we won't test it, Mesa certainly isn't going to expose it on 8xx or 915. Based on this, a simplified interface became obvious: for each hardware context, track a mask of kinds of resets that have affected that context. I defined three bits in the mask: 1. The hw context had a batch executing when the reset occurred. 2. The hw context had a batch pending when the reset occurred. Hopefully in the future we could make most occurrences of this go away by re-queuing the batch. It may also be possible to do this in user-mode, but that may be more difficult. 3. "Other." There's also the potential to add other bits to communicate information about lost buffers, textures, etc. This could be used to create a layered extension that lets applications transfer data from the dead context to the new context. Browsers may not be too interested in this, but I could imagine compositors or other kinds of applications benefiting. We may never implement that, but it's easier to communicate this through a set of flags than through a set of counts. The guilty / innocent counts in this patch series lose information. We can probably reconstruct the current flags from the counts, but we would have to do some refactoring (or additional counting) to get other flags in the future. All of this would make the kernel code more complex. Why reconstruct the data you want when you could just track that data in the first place? Since the functionality in not available on all hardware, I also added a query to determine the availability. This has the added benefit that, should this interface prove to be insufficient in the future, we could deprecate it by having the query always return false. All software involved has to be able the handle the case where the interface is not available, so I don't think this should run awry of the rules about kernel/user interface breakage. Please correct me if I'm wrong. The Mesa patches are in my robustness2 branch: http://cgit.freedesktop.org/~idr/mesa/log/?h=robustness2 As Mika pointed out, the patch that makes use of the new kernel interface is: http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=ef97f4092c703afd49b3516b15dfbfcd27d4bc97 In addition, http://cgit.freedesktop.org/~idr/mesa/commit/?h=robustness2&id=6f73cbeecb90b0a08b0f016e19a16f6f50a1a99c changes to brw_context.c to use intel_get_boolean to determine the availability of the kernel query. I'd really like to only expose __DRI2_ROBUSTNESS when the kernel query exists, but I don't see a way to do that with the DRI extension mechanism. The libdrm patches are in master of: http://cgit.freedesktop.org/~idr/drm I had attacked the problem from the opposite end from Mika. My initial implementation always reported "other" because a lot of other work needed to happen to differentiate the other cases. It's good to see that work progressing. My really, really, really incomplete kernel patch is at: http://people.freedesktop.org/~idr/robust.patch This patch was against some kernel from early September 2012. It may not apply or build at this point. I believe I was able to observe a reset by having an app run a shader with an infinite loop and poll glGetGraphicsResetStatusARB. I couldn't find that code, so I may be misremembering. With the exception of the I915_PARAM_HAS_RESET_QUERY handling, nearly all of this patch is probably useless at this point.