On Wed, Oct 05, 2011 at 12:37:28PM +0530, K.Prasad wrote: > > Well, there are MCE types for which we need to panic but we don't > > necessarily corrupt memory. Your approach is to unconditionally avoid > > dumping core whenever we panic while you should look at the MCE > > signature and decide then whether to capture crashed kernel memory or > > not. > > > > For example, if the MCE signature says UC DRAM error, then you can > > be pretty sure that there is a landmine somewhere in the DRAM region > > mapping the crashed kernel. If it is, say, a UC when doing data fills > > from L2 to L1, that doesn't necessarily mean that DRAM is corrupted. But > > even in the first case, you can evaluate the MCi_ADDR reported with the > > UC DRAM error and simply skip that particular cacheline when dumping the > > core instead of not capturing anything at all. > > > > True. Like stated by me earlier, there could be two possible outcomes > from capturing memory dump in such cases - they're either dangerous or > doesn't make sense. Why, in the second example the only corruption is to the L2 cache so your memory image is intact. Why wouldn't you want to capture a memory dump then? It is business as usual in that case. > It is best to avoid a normal kdump in both cases, > although the elf-note doesn't distinguish between the two. > > NT_NOCOREDUMP, in my opinion, is just the first step towards introducing > a framework where different code paths that lead to panic() can > 'opt-out' from kdump by adding an elf-note. > > We can modify this to add more fine-grained messages using different elf-note > types (or use the elf-note name under the NT_NOCOREDUMP type) to > indicate the cause/type of crash. > > I'd like to hear further from you and the rest of the community to see if > there's a need felt for such a change. I'd make this conditional on whether you have had memory corruption or not by evaluating MCE signatures and acting accordingly. > > Btw, the doublefault example you give above - is this something you > > experience on real hardware or just a theoretical thing? > > > > Unfortunately, I still haven't been able to try injecting memory errors > and study the behaviour (trying to get access to machine with > appropriate firmware). I'll have a reply to this after some experiments > with memory error injection. Right, this might be much more helpful than theoretical discussions on what to do. :-) Thanks. -- Regards/Gruss, Boris.