On Thu, Jun 13, 2019 at 09:54:18AM +1000, Benjamin Herrenschmidt wrote: > It tends to be a slippery slope. Also in the ARM world, most SoC tend > to re-use IP blocks, so you get a lot of code duplication, bug fixed in > one and not the other etc... Yes, I'd like to be able to reuse EDAC drivers if they're for single IP blocks and those IP blocks get integrated by multiple vendors. > I don't necessarily mind having a "platform" component that handles > policies in case where userspace is really not an option, but it > shouldn't be doing it by containing the actual drivers for the > individual IP block error collection. It could however "use" them via > in-kernel APIs. Ok, sounds good. > Those are rare. At the end of the day, if you have a UE on memory, it's > a matter of luck. It could have hit your kernel as well. You get lucky > it only hit userspace but you can't make a general statement you "can't > trust userspace". I'm not saying that - I'm saying that if we're going to do a comprehensive solution we better address all possible error severities with adequate handling. > Cache errors tend to be the kind that tend to have to be addressed > immediately, but even then, that's often local to some architecture > machine check handling, not even in EDAC. That's true. > Do you have a concrete example of a type of error that > > - Must be addressed in the kernel > > - Relies on coordinating drivers for more than one IP block > > ? My usual example at the end of the #MC handler on x86, do_machine_check(): /* Fault was in user mode and we need to take some action */ if ((m.cs & 3) == 3) { ist_begin_non_atomic(regs); local_irq_enable(); if (kill_it || do_memory_failure(&m)) force_sig(SIGBUS, current); we try to poison and if we fail or have to kill the process anyway, off it goes. Yes, this is not talking to EDAC drivers yet but is exemplary for a more involved recovery action. > Even then though, my argument would be that the right way to do that, > assuming that's even platform specific, would be to have then the > "platform RAS driver" just layout on top of the individual EDAC drivers > and consume their output. Not contain the drivers themselves. Ok, that's a fair point and I like the design of that. > Using machine checks, not EDAC. It's completely orghogonal at this > point at least. No, it is using errors reported through the Machine Check Architecture. EDAC uses the same DRAM error reports. They all come from MCA on x86. It is a whole notifier chain which gets to see those errors but they all come from MCA. PCI errors get reported differently, of course. EDAC is just a semi-dumb layer around some of those error reporting mechanisms. > That said, it would make sense to have an EDAC API to match that > address back into a DIMM location and give user an informational > message about failures happening on that DIMM. But that could be done > via core EDAC MC APIs. That's what the EDAC drivers on x86 do. All of them :-) > Here too, no need for having an over-arching platform driver. Yes, the EDAC drivers which implement all the memory controller functionality, already do that mapping back. Or at least try to. There's firmware doing that on x86 too but that's a different story. <will reply to the rest later in another mail as this one is becoming too big anyway>. Thx. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.