On 05/22/2018 09:54 AM, Borislav Petkov wrote: > On Tue, May 22, 2018 at 09:39:15AM -0500, Alex G. wrote: >> No, the problem is with the current approach, not with mine. The problem >> is trying to handle the error outside of the existing handler. That's a >> no-no, IMO. > > Let me save you some time: until you come up with a proper solution for > *all* PCIe errors so that the kernel can correctly decide what to do for > each error based on its actual severity, consider this NAKed. I do have a proper solution for _all_ PCIe errors. In fact, we discussed several valid approaches already. > I don't care about outside or inside of the handler I do. I have a handler that can handle (no pun intended) errors. I want to use the same code path in native and GHES cases. If I allow ghes.c to take different decisions than what aer_do_recovery() would, I've failed. >- this thing needs to be done properly Exactly! > and not just to serve your particular use case of > abrupt removal of devices causing PCIe errors, and punish the rest. I think you're confused about what I'm actually trying to do. Or maybe you're confused about how PCIe errors work. That's understandable. PCIe uses the term "fatal" for errors that may make the link unusable, and which may require a link reset, and in most other specs "fatal" means "on fire". I understand your confusion, and I hope I cleared it up. You're trying to make the case that surprise removal is my only concern and use case, because that's the example that I gave. It makes your argument stronger, but it's wrong. You don't know our test setup, and all the things I'm testing for, and whenever I try to tell you, you fall back to the 'surprise removal' example. I don't know why you'd think Dell would pay me to work on this if I were to allow things like silent data corruption to creep in. This isn't a traditional company from Redmond, Washington. > I especially don't want to have the case where a PCIe error is *really* > fatal and then we noodle in some handlers debating about the severity > because it got marked as recoverable intermittently and end up causing > data corruption on the storage device. Here's a real no-no for ya. I especially don't want a kernel maintainer who hasn't even read the recovery handler (let alone the spec around which the handler was written) tell me how the recovery handler works and what it's supposed to do (see, I can be an ass). PCIe errors really are fatal. They might need to unload the driver and remove the device. But somebody set the questionable policy that "fatal"=="panic", and that is completely inappropriate for a larger class of errors -- PCIe happens to be the easiest example to pick on. And even you realize that the argument that a panic() will somehow prevent data corruption is complete noodle sauce. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html