Re: [PATCH v4] /dev/mem: Revoke mappings when a driver claims the region

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jun 2, 2021 at 8:40 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> [+cc Pali, Oliver]
>
> On Thu, May 27, 2021 at 02:30:31PM -0700, Dan Williams wrote:
> > On Thu, May 27, 2021 at 1:58 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > >
> > > [+cc Daniel, Krzysztof, Jason, Christoph, linux-pci]
> > >
> > > On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
> > > > Close the hole of holding a mapping over kernel driver takeover event of
> > > > a given address range.
> > > >
> > > > Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> > > > introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
> > > > kernel against scenarios where a /dev/mem user tramples memory that a
> > > > kernel driver owns. However, this protection only prevents *new* read(),
> > > > write() and mmap() requests. Established mappings prior to the driver
> > > > calling request_mem_region() are left alone.
> > > >
> > > > Especially with persistent memory, and the core kernel metadata that is
> > > > stored there, there are plentiful scenarios for a /dev/mem user to
> > > > violate the expectations of the driver and cause amplified damage.
> > > >
> > > > Teach request_mem_region() to find and shoot down active /dev/mem
> > > > mappings that it believes it has successfully claimed for the exclusive
> > > > use of the driver. Effectively a driver call to request_mem_region()
> > > > becomes a hole-punch on the /dev/mem device.
> > >
> > > This idea of hole-punching /dev/mem has since been extended to PCI
> > > BARs via [1].
> > >
> > > Correct me if I'm wrong: I think this means that if a user process has
> > > mmapped a PCI BAR via sysfs, and a kernel driver subsequently requests
> > > that region via pci_request_region() or similar, we punch holes in the
> > > the user process mmap.  The driver might be happy, but my guess is the
> > > user starts seeing segmentation violations for no obvious reason and
> > > is not happy.
> > >
> > > Apart from the user process issue, the implementation of [1] is
> > > problematic for PCI because the mmappable sysfs attributes now depend
> > > on iomem_init_inode(), an fs_initcall, which means they can't be
> > > static attributes, which ultimately leads to races in creating them.
> >
> > See the comments in iomem_get_mapping(), and revoke_iomem():
> >
> >         /*
> >          * Check that the initialization has completed. Losing the race
> >          * is ok because it means drivers are claiming resources before
> >          * the fs_initcall level of init and prevent iomem_get_mapping users
> >          * from establishing mappings.
> >          */
> >
> > ...the observation being that it is ok for the revocation inode to
> > come on later in the boot process because userspace won't be able to
> > use the fs yet. So any missed calls to revoke_iomem() would fall back
> > to userspace just seeing the resource busy in the first instance. I.e.
> > through the normal devmem_is_allowed() exclusion.
>
> I did see that comment, but the race I meant is different.  Pali wrote
> up a nice analysis of it [3].
>
> Here's the typical enumeration flow for PCI:
>
>   acpi_pci_root_add                 <-- subsys_initcall (4)
>     pci_acpi_scan_root
>       ...
>         pci_device_add
>           device_initialize
>           device_add
>             device_add_attrs        <-- static sysfs attributes created
>     ...
>     pci_bus_add_devices
>       pci_bus_add_device
>         pci_create_sysfs_dev_files
>           if (!sysfs_initialized) return;    <-- Ugh :)
>           ...
>             attr->mmap = pci_mmap_resource_uc
>             attr->mapping = iomem_get_mapping()  <-- new dependency
>               return iomem_inode->i_mapping
>             sysfs_create_bin_file   <-- dynamic sysfs attributes created
>
>   iomem_init_inode                  <-- fs_initcall (5)
>     iomem_inode = ...               <-- now iomem_get_mapping() works
>
>   pci_sysfs_init                    <-- late_initcall (7)
>     sysfs_initialized = 1           <-- Ugh (see above)
>     for_each_pci_dev(dev)           <-- Ugh
>       pci_create_sysfs_dev_files(dev)
>
> The race is between the pci_sysfs_init() initcall (intended for
> boot-time devices) and the pci_bus_add_device() path (used for all
> devices including hot-added ones).  Pali outlined cases where we call
> pci_create_sysfs_dev_files() from both paths for the same device.
>
> "sysfs_initialized" is a gross hack that prevents this most of the
> time, but not always.  I want to get rid of it and pci_sysfs_init().
>
> Oliver had the excellent idea of using static sysfs attributes to do
> this cleanly [4].  If we can convert things to static attributes, the
> device core creates them in device_add(), so we don't have to create
> them in pci_create_sysfs_dev_files().
>
> Krzysztof recently did some very nice work to convert most things to
> static attributes, e.g., [5].  But we can't do this for the PCI BAR
> attributes because they support ->mmap(), which now depends on
> iomem_get_mapping(), which IIUC doesn't work until after fs_initcalls.

Ah, sorry, yes, I see the race now. And yes, anything that gets in the
way of the static attribute conversion needs fixing. How about
something like this?

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index beb8d1f4fafe..c8bc249750d6 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1195,7 +1195,7 @@ static int pci_create_attr(struct pci_dev *pdev,
int num, int write_combine)
                }
        }
        if (res_attr->mmap)
-               res_attr->mapping = iomem_get_mapping();
+               res_attr->mapping = iomem_get_mapping;
        res_attr->attr.name = res_attr_name;
        res_attr->attr.mode = 0600;
        res_attr->size = pci_resource_len(pdev, num);
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 9aefa7779b29..a3ee4c32a264 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -175,7 +175,7 @@ static int sysfs_kf_bin_open(struct kernfs_open_file *of)
        struct bin_attribute *battr = of->kn->priv;

        if (battr->mapping)
-               of->file->f_mapping = battr->mapping;
+               of->file->f_mapping = battr->mapping();

        return 0;
 }
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index d76a1ddf83a3..fbb7c7df545c 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -170,7 +170,7 @@ struct bin_attribute {
        struct attribute        attr;
        size_t                  size;
        void                    *private;
-       struct address_space    *mapping;
+       struct address_space *(*mapping)(void);
        ssize_t (*read)(struct file *, struct kobject *, struct bin_attribute *,
                        char *, loff_t, size_t);
        ssize_t (*write)(struct file *, struct kobject *, struct
bin_attribute *,



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux