On Fri, May 28, 2021 at 1:58 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 27.05.21 23:30, Dan Williams wrote: > > On Thu, May 27, 2021 at 1:58 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > >> > >> [+cc Daniel, Krzysztof, Jason, Christoph, linux-pci] > >> > >> On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote: > >>> Close the hole of holding a mapping over kernel driver takeover event of > >>> a given address range. > >>> > >>> Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges") > >>> introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the > >>> kernel against scenarios where a /dev/mem user tramples memory that a > >>> kernel driver owns. However, this protection only prevents *new* read(), > >>> write() and mmap() requests. Established mappings prior to the driver > >>> calling request_mem_region() are left alone. > >>> > >>> Especially with persistent memory, and the core kernel metadata that is > >>> stored there, there are plentiful scenarios for a /dev/mem user to > >>> violate the expectations of the driver and cause amplified damage. > >>> > >>> Teach request_mem_region() to find and shoot down active /dev/mem > >>> mappings that it believes it has successfully claimed for the exclusive > >>> use of the driver. Effectively a driver call to request_mem_region() > >>> becomes a hole-punch on the /dev/mem device. > >> > >> This idea of hole-punching /dev/mem has since been extended to PCI > >> BARs via [1]. > >> > >> Correct me if I'm wrong: I think this means that if a user process has > >> mmapped a PCI BAR via sysfs, and a kernel driver subsequently requests > >> that region via pci_request_region() or similar, we punch holes in the > >> the user process mmap. The driver might be happy, but my guess is the > >> user starts seeing segmentation violations for no obvious reason and > >> is not happy. > >> > >> Apart from the user process issue, the implementation of [1] is > >> problematic for PCI because the mmappable sysfs attributes now depend > >> on iomem_init_inode(), an fs_initcall, which means they can't be > >> static attributes, which ultimately leads to races in creating them. > > > > See the comments in iomem_get_mapping(), and revoke_iomem(): > > > > /* > > * Check that the initialization has completed. Losing the race > > * is ok because it means drivers are claiming resources before > > * the fs_initcall level of init and prevent iomem_get_mapping users > > * from establishing mappings. > > */ > > > > ...the observation being that it is ok for the revocation inode to > > come on later in the boot process because userspace won't be able to > > use the fs yet. So any missed calls to revoke_iomem() would fall back > > to userspace just seeing the resource busy in the first instance. I.e. > > through the normal devmem_is_allowed() exclusion. > > > >> > >> So I'm raising the question of whether this hole-punch is the right > >> strategy. > >> > >> - Prior to revoke_iomem(), __request_region() was very > >> self-contained and really only depended on the resource tree. Now > >> it depends on a lot of higher-level MM machinery to shoot down > >> mappings of other tasks. This adds quite a bit of complexity and > >> some new ordering constraints. > >> > >> - Punching holes in the address space of an existing process seems > >> unfriendly. Maybe the driver's __request_region() should fail > >> instead, since the driver should be prepared to handle failure > >> there anyway. > > > > It's prepared to handle failure, but in this case it is dealing with a > > root user of 2 minds. > > > >> > >> - [2] suggests that the hole punch protects drivers from /dev/mem > >> writers, especially with persistent memory. I'm not really > >> convinced. The hole punch does nothing to prevent a user process > >> from mmapping and corrupting something before the driver loads. > > > > The motivation for this was a case that was swapping between /dev/mem > > access and /dev/pmem0 access and they forgot to stop using /dev/mem > > when they switched to /dev/pmem0. If root wants to use /dev/mem it can > > use it, if root wants to stop the driver from loading it can set > > mopdrobe policy or manually unbind, and if root asks the kernel to > > load the driver while it is actively using /dev/mem something has to > > give. Given root has other options to stop a driver the decision to > > revoke userspace access when root messes up and causes a collision > > seems prudent to me. > > > > Is there a real use case for mapping pmem via /dev/mem or could we just > prohibit the access to these areas completely? The kernel offers conflicting access to iomem resources and a long-standing mechanism to enforce mutual exclusion (CONFIG_IO_STRICT_DEVMEM) between those interfaces. That mechanism was found to be incomplete for the case where a /dev/mem mapping is maintained after a kernel driver is attached, and incomplete for other mechanisms to map iomem like pci-sysfs. This was found with PMEM, but the issue is larger and applies to userspace drivers / debug in general. > What's the use case for "swapping between /dev/mem access and /dev/pmem0 > access" ? "Who knows". I mean, I know in this case it was a platform validation test using /dev/mem for "reasons", but I am not sure that is relevant to the wider concern. If CONFIG_IO_STRICT_DEVMEM=n exclusion is enforced when drivers pass the IORESOURCE_EXCLUSIVE flag, if CONFIG_IO_STRICT_DEVMEM=y exclusion is enforced whenever the kernel marks a resource IORESOURCE_BUSY, and if kernel lockdown is enabled the driver state is moot as LOCKDOWN_DEV_MEM and LOCKDOWN_PCI_ACCESS policy is in effect.