On Tue, Jun 04, 2024 at 04:56:57PM -0700, Dan Williams wrote: > Jakub Kicinski wrote: > [..] > > I don't begrudge anyone building proprietary options, but leave > > upstream out of it. > > So I am of 2 minds here. In general, how is upstream benefited by > requiring every vendor command to be wrapped by a Linux command? People actually can use upstream :) Amazingly there is inherit benefit to people being able to use the software we produce. > 3 years on from that recommendation it seems no vendor has even needed > that level of distribution help. I.e. checking a few distro kernels > (Fedora, openSUSE) shows no uptake for CONFIG_CXL_MEM_RAW_COMMANDS=y in > their debug builds. I can only assume that locally compiled custom > kernel binaries are filling the need. My strong advice would be to be careful about this. Android-ism where nobody runs the upstream kernel is a real thing. For something emerging like CXL there is a real risk that the hyperscale folks will go off and do their own OOT stuff and in-tree CXL will be something usuable but inferior. I've seen this happen enough times.. If people come and say we need X and the maintainer says no, they don't just give up and stop doing X, the go and do X anyhow out of tree. This has become especially true now that the center of business activity in server-Linux is driven by the hyperscale crowd that don't care much about upstream. Linux maintainer's don't actually have the power to force the industry to do things, though people do keep trying.. Maintainers can only lead, and productive leading is not done with a NO. You will start to see this pain in maybe 5-10 years if CXL starts to be something deployed in an enterprise RedHat/Dell/etc sort of environment. Then that missing X becomes a critical issue because it turns out the hyperscale folks long since figured out it is really important but didn't do anything to enable it upstream. There is merit in upstream being something people can and do actually use, not just an ivory tower of architectural perfection. There is merit in bringing code into the community instead of forcing things to be OOT. For instance the thread you linked where there was talk of needing the signal integrity data is a great example. Sure some of that is manufacturing time, but also if you deploy a million interfaces in a datacenter, then yes, there will be need to collect SI information from live systems and do some analysis on it. You wouldn't believe how much physically broken HW leaks out into data centers and needs manufacturing level debugging techniques to properly root cause :( > userpace-to-device-firmware tunnel?" to at least get all the various > concerns documented in one place, and provide guidance for how device > vendors should navigate this space across subsystems. This is my effort here. If we document the expectations there is a much better chance that a standard body or device manufacturer can implement their interfaces in a way that works with the OS. There is a much higher chance they will attract CVEs and be forced to fix it if the security expectations are clearly laid out. You had a good observation in one of those links about how they are not OS people. Let's help them do better. Shunt the less robust stuff to fwctl and then people can also make their own security choices, don't enable or load the fwctl modules and you get more protection. It is closer to your CONFIG_CXL_MEM_RAW_COMMANDS=y but at runtime. I think I captured most of your commentary below here in patch 6. > Effects Log". In that "trust Command Effects" scenario the kernel still > has no idea what the command is actually doing, but it can at least > assert that the device does not claim that the command changes the > contents of system-memory. Now, you might say, "the device can just > lie", but that betrays a conceit of the kernel restriction. A device > could lie that a Linux wrapped command when passed certain payloads does > not in turn proxy to a restricted command. Yeah, we have to trust the device. If the device is hostile toward the OS then there are already big problems. We need to allow for unintentional defects in the devices, but we don't need to be paranoid. IMHO a command effects report, in conjunction with a robust OS centric defintion is something we can trust in. > * Introspection / validation: Subsystem community needs to be able to > audit behavior after the fact. > > To me this means even if the kernel is letting a command through based > on the stated Command Effect of "Configuration Change after Cold Reset" > upstream community has a need to be able to read the vendor > specification for that command. I.e. commands might be vendor-specific, > but never vendor-private. I see this as similar to the requirement for > open source userspace for sophisticated accelerators. I'm less hard on this. As long as reasonable open userspace exists I think it is fine to let other stuff through too. I can appreciate the DRM stance on this, but IMHO, there is meaningfully more value for open source in trying get an open Vulkan implementation vs blocking users from reading their vendor'd diagnostic SI values. I don't think we should get into some kind of extremism and insist that every single bit must be documented/standardized or Linux won't support it. This is why I envision fwctl as not being suitable for actual datapath/performance stuff. > * Collaboration: open standards support open driver maintenance. > > Without standards we end up with awkward situations like Confidential > Computing where every vendor races to implement the same functionality > in arbitrarily different and vendor specific ways. Standard are important. Linux is not a standards body. Linux maintainers can only advise, not force, the industry to make standards. At a certain point Linux's job is to implement software to support what people have built. CC is a sad example where the industry did not get together enough, but still Linux will support the CC mess. > For CXL devices, and I believe the devices fwctl is targeting, there > are a whole class of commands for vendor specific configuration and > debug. Commands that the kernel really need not worry about. Right. > Some subsystems may want to allow high-performance science experiments > like what NVMe allows, but it seems worth asking the question if > standardizing device configuration and debug is really the best use of > upstream's limited time?