On Mon, Oct 25, 2010 at 01:15:30PM +0200, Ingo Molnar wrote: > > > > einj.c: it's about the 3rd separate 'error injection' concept that got > > > > introduced ... > > > > > > EINJ is a true platform feature, not just software feature. We need to support > > > it to debug various hardware error features. > > > > Also having multiple error injecting interfaces is a good thing. > > It's never a good thing to have separate, vendor dependent interfaces for what to > the user is basically the same conceptual thing! Perhaps a simple example (simplified, in practice there are more complications) makes it more clear: The memory error handler does different actions depending on what the state the page the error is happening on is in. To get reasonable coverage of the recovery code you need to present it with pages in different states (like locked, clean, dirty, IO etc. etc. ) Now it turns out this is very hard to do if you just inject the error at the hardware level, because there are lots of races and problems ensuring the page is still in the expected state etc.etc. So one of the solution hwpoison did for this was to have another injector that works on the process level. At the process level you can get pages into different stages and reasonably cleanly inject the right error with the right context. This is essentially error injection at the VM level. Again this is simplified, for coverage we actually needed multiple injectors that work at different entry points, e.g. for example to make sure buffered file system pages are correctly handled too. Now that's great, but we still need other injectors that work on other level, otherwise the part that talks to the hardware are not covered at all But you cannot test all the code paths of that code either using a single hardware injector. So there's another one that can fake different contexts at the software level and provides reasonable coverage of this code. But then you still didn't test the whole hardware to software error path. Now yes you could use a EDAC like ECC bits injector (which BTW doesn't really need a kernel driver, we did it just using shell scripts fine before). But that also only tests one path and not the others possibilities, and also only works on specific hardware in specific modes with very careful setup. But that's just one type of error for one system. So you need other interfaces for other hardware and for other errors etc.etc. In some cases you also need to talk to the BIOS to do this injection for various reasons, that is where ACPI comes in (and all these acronyms you seem to object to) Also it's not enough to do single error injection once on some system. You need a repeatable regression test that ensure the error handling stays operable for kernels as they evolve. This requires that the error injection is reasonably portable. For this I tried to have a "software only" injector for nearly everything just to make sure the code can be always tested. Unfortunately the software injectors, especially the ones aiming at larger coverage, also have quite different interface requirements than hardware interfaces. But again you still need to test the full hardware too, otherwise we don't know if the error handling is really working on a real system in practice. Error injection is just messy. There is no single general solution that works for everything and solves all problems, but you really need a pragmatic approach for every subsystem. In the end it means you end up with lots of different injectors, all tied to some specific problem. Would it be nice if there was a single great injector that covers everything? Yes Is it realistic? No. -Andi -- ak@xxxxxxxxxxxxxxx -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html