On Wed, Mar 24, 2021 at 11:55 PM Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: > > On Wed, Mar 24, 2021 at 06:23:54PM -0700, Dan Williams wrote: > > The PCIE Data Object Exchange (DOE) mailbox is a protocol run over > > configuration cycles. It assumes one initiator at a time is > > reading/writing the data registers. > > That sounds like a horrible protocol for a multi-processor system. > Where is it described and who can we go complain to for creating such a > mess? It appears it was added to the PCIE specification in March of last year, I don't attend those meetings. I only learned about it since the CXL specification adopted it. > > > If userspace reads from the response > > data payload it may steal data that a kernel driver was expecting to > > read. If userspace writes to the request payload it may corrupt the > > request a driver was trying to send. > > Fun! So you want to keep root in userspace from doing this? I thought > we already do that today? The only limitation I found was temporary locking via pci_cfg_access_lock(), and some limitations on max config offset, not permanent access disable. > > > Introduce pci_{request,release}_config_region() for a driver to exclude > > the possibility of userspace induced corruption while accessing the DOE > > mailbox. Likely there are other configuration state assumptions that a > > driver may want to assert are under its exclusive control, so this > > capability is not limited to any specific configuration range. > > As you do not have a user for these functions, it's hard to see how they > would be used. We also really can't add new apis with no in-tree users, > so do you have a patch series that requires this functionality > somewhere? Whoops, I buried the lead here. This is in reaction to / support of Jonathan's efforts to use this mailbox to retrieve a blob of memory characteristics data from CXL devices called the CDAT [1]. That blob is used to populate / extend ACPI SRAT/HMAT/SLIT data for CXL attached memory. So while I was reviewing his patches it occurred to me that the b0rked nature of this interface relative to pci-sysfs needed to be addressed. This should wait to go in with his series. [1]: https://lore.kernel.org/linux-acpi/20210310180306.1588376-2-Jonathan.Cameron@xxxxxxxxxx/ > > > Since writes are targeted and are already prepared for failure the > > entire request is failed. The same can not be done for reads as the > > device completely disappears from lspci output if any configuration > > register in the request is exclusive. Instead skip the actual > > configuration cycle on a per-access basis and return all f's as if the > > read had failed. > > returning all ff is a huge hint to many drivers that the device is gone, > not that it just failed. So what happens to code that thinks that and > then tears stuff down as if the device has been removed? This is limited to the pci_user_* family of accessors, kernel drivers should be unaffected. The protection for kernel drivers colliding is the same as request_mem_region() coordination. Userspace drivers will of course be horribly confused, but those should not be binding to devices that are claimed by a kernel driver in the first place. > Trying to protect drivers from userspace here feels odd, what userspace > tools are trying to access these devices while they are under > "exclusive" control from the kernel? lspci not running as root should > not be doing anything crazy, but if you want to run it as root, > shouldn't you be allowed to access it properly? The main concern is unwanted userspace reads. An inopportune "hexdump /sys/bus/pci/devices/$device/config" will end up reading the DOE payload register and advancing the device state machine surprising the kernel iterator that might be reading the payload. If root really wants to read that portion of config space it can also unload the driver similar to the policy for /dev/mem colliding with exclusive device-mmio... although that raises the question how would root know. At least for exclusive /dev/mem /proc/iomem can show who is claiming that resource. If userspace needs to submit DOE requests then that should probably be a proper generic driver to submit requests, not raw pci config access. > What hardware has this problem that we need to claim exclusive ownership > over that differs from the old hardware we used to have that would do > crazy things when reading from from userspace? We had this problem a > long time ago and lived with it, what changed now? All I can infer from the comments in drivers/pci/access.c is "bad things happens on some devices if you allow reads past a certain offset", but no concern for reads for offsets less than pdev->cfg_size. I think what's changed is that this is the first time Linux has had to worry about an awkward polled I/O data transfer protocol over config cycles. To make matters worse there appears to be a proliferation of protocols being layered on top of DOE. In addition to CXL Table Access for CDAT retrieval [2] I'm aware of CXL Compliance Testing [3], Integrity and Data Encryption (IDE) [4], and Component Measurement and Authentication (CMA) [5]. I've not read those, but I worry security_locked_down() may want to prevent even root userspace mucking with "security" interfaces. So that *might* be a reason to ensure exclusive kernel access beyond the basic sanity of the kernel being able to have uninterrupted request / response sessions with this mailbox [2]: https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf [3]: https://www.computeexpresslink.org/download-the-specification [4]: https://members.pcisig.com/wg/PCI-SIG/document/15149 [5]: https://members.pcisig.com/wg/PCI-SIG/document/14236