On Wed, Jun 12, 2019 at 06:29:26PM +1000, Benjamin Herrenschmidt wrote: > I tend to disagree here. We've been down that rabbit hole in the past > and we (Linux in general) are trying to move away from that sort of > "platform" overarching driver as much as possible. Why is a "platform" driver like that so bad? > This is a policy. It should either belong to userspace, For some errors you can't do userspace as it is too late for it - you wanna address that before you return to it. > or be in some generic RAS code in the kernel, there's no reason why > these can't be abstracted. Yes, we have this drivers/ras/cec.c thing which collects correctable DRAM errors on x86. :-) > Also in your specific example, it could be entirely local to the MC > EDAC / DRAM controller path, we could have a generic way for EDAC to > advertise that a given memory channel is giving lots of errors and > have memory controller drivers listen to it but usually the EDAC MC > driver *is* the only thing that looks like a MC driver to begin with, > > so again, pretty much no overlap with L1/L2 caches RAS or PCIe RAS > etc... > > Unless I'm mistaken, that amd64 EDAC is just an MC one... but I only > had a cursory glance at the code. EDAC has historically been concentrating on DRAM errors as that is what people have been paying attention to. But it isn't limited to DRAM errors - there is some basic PCI errors functionality behind edac_pci_create_generic_ctl() which polls for PCI parity errors. So it still makes sense to me to have a single driver which takes care of all things RAS for a platform. You just load one driver and it does it all, including recovery actions. > Maybe because what you are promoting might not be the right path > here... seriously, there's a reason why all vendors want to go down > that path and in this case I don't think they are wrong. > > This isn't about just another ARM vendor, in fact I'm rather new to the > whole ARM thing, I used to maintain arch/powerpc :-) What happened? :-P > The point is what you are trying to push for goes against everything > we've been trying to do in Linux when it comes to splitting drivers to > individual IP blocks. Please do explain why is a driver-per-IP-block better and please don't give me the DT argument - I've heard that one more than enough now. Like, for example, how do you deal with the case where you have a platform which has a bunch of IP blocks with RAS functionality and they all have a separate driver. Now, you want to do recovery and failover actions depending on certain error count from a certain group of IP blocks. You export those counts through sysfs from the separate drivers and you have a userspace agent doing that policy? That cannot always fly because recovery actions for some errors need to happen before we return to userspace - i.e., memory-failure.c types of scenarios. You add another "counting" layer which is yet another driver which collects those errors and applies policy actions? But then that layer needs to be made generic enough to be shared by the other EDAC IP block drivers, otherwise every platform would need its own counter layer driver. Which basically puts the problem somewhere else but doesn't make it go away. Another way I'm not thinking of at the moment? A single driver solves that problem as it has all the required information in one place and deals with it then and there. I hear you that platform drivers are frowned upon but connecting it all in one place for such purposes makes sense to me in this particular case. Btw, what is your final goal with these drivers? Dump decoded error information in dmesg? Or something more sophisticated? Thx. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.