Re: Linux 5.15-rc2

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 20 Sep 2021 09:18:30 -0700

On Mon, Sep 20, 2021 at 6:44 AM Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
>
> Details for build failures below. Several improvements since last week,
> but it looks like the alpha related pci_iounmap patches still need some
> tweaking (see error log at the very end).

Bah.

I thought I had tested sparc64, but I was wrong.

Silly me, I had only tested the 32-bit case.

That sparc64 thing is being particularly stupid: sparc64 uses
GENERIC_PCI_IOMAP, but declares its own empty pci_iounmap() because it
didn't use GENERIC_IOMAP that does this all right.

And because it didn't use the generic iomap code, it's all actually
entirely buggy, in that it seems to think that pci_iounmap() is about
unmapping ports (like ioport_unmap) and thus a no-op. But no,
pci_iounmap() is supposed to unmap anything that pci_iomap() mapped,
which includes the actual MMIO range too.

Basically, the whole idea of "pci_iomap()" is that you give it a PCI
device and the index to the BAR in that device, and it maps that BAR -
whether it is MMIO or PIO. And then you can use that __iomem pointer
for ioread*() and friends (or you can use readl()/writel() if you know
it was MMIO).

You can give it a maximum length if you want, but by default it just
maps the whole PCI BAR, so the default usage would just be

    void __iomem *map = pci_iomap(pdev, bar, 0);

And then you do whatever IO using that 'map' base pointer, and once
you're done you do "pci_iounmap()" on it all.

And then the trick most cases use is that they know that the PIO case
is just always a fixed map, so for PIO that "map/unmap" part os a
no-op. But generally ONLY for the PIO case.

And the sparc64 code seems to think it's only used for PIO, and makes
pci_iounmap() a no-op in general. Which is all kinds of completely
broken.

This is the same bug that the broken inline function in
<asm-generic/io.h> had, and that I added a big comment about in commit
316e8d79a095 ("pci_iounmap'2: Electric Boogaloo: try to make sense of
it all"):

+ * This code is odd, and the ARCH_HAS/ARCH_WANTS #define logic comes
+ * from legacy <asm-generic/io.h> header file behavior. In particular,
+ * it would seem to make sense to do the iounmap(p) for the non-IO-space
+ * case here regardless, but that's not what the old header file code
+ * did. Probably incorrectly, but this is meant to be bug-for-bug
+ * compatible.

but I intentionally didn't fix the bug in that commit, because I
wanted to just try to keep the odd old logic as closely as possible.

It looks like a big part of the "people do their own pci_iounmap()"
thing is that they do it badly and with bugs.

This was all meant to uncover and fix warnings, but it seems to be
uncovering bigger issues.

Of course, most of the time the "pci_iounmap()" only happens at driver
unload time, so it's basically only a kernel virtual memory mapping
leak, which may be why people didn't realize how buggy their own
implementations were.

What the normal GENERIC_IOMAP code does is:

 - it "fake maps" the PIO space at an invalid fixed virtual address

   Since we know that a PIO address on PCI is just a 16-bit number,
this fake virtual window is small and easy to do:

        /*
         * We encode the physical PIO addresses (0-0xffff) into the
         * pointer by offsetting them with a constant (0x10000) and
         * assuming that all the low addresses are always PIO. That means
         * we can do some sanity checks on the low bits, and don't
         * need to just take things for granted.
         */
        #define PIO_OFFSET      0x10000UL
        #define PIO_MASK        0x0ffffUL
        #define PIO_RESERVED    0x40000UL

   so the logic is basically that we can trivially test whether a
"void __iomem *" pointer is a PIO pointer or not: if the pointer value
is in that range of PIO_OFFSET..PIO_OFFSET+PIO_MASK range, it's PIO,
otherwise it's mmio.

 - the MMIO space acts using all the normal ioremap() logic, and we
can tell the end result apart from PIO with the above trivial thing.

 - the GENERIC_IOMAP code internally just has a IO_COND(adds, is_pio,
is_mmio) helper macro, which sets "port" for the is_pio case, and
"addr" for the is_mmio case, so you can do trivial things like this:

        unsigned int ioread8(const void __iomem *addr)
        {
                IO_COND(addr, return inb(port), return readb(addr));
                return 0xff;
        }

   which does the "inb(port)" for the PIO case, and the "readb(addr)"
for the MMIO case.

 - and lookie here what the GENERIC_IOMAP code for pci_iounmap() is:

        void pci_iounmap(struct pci_dev *dev, void __iomem * addr)
        {
                IO_COND(addr, /* nothing */, iounmap(addr));
        }

  IOW, for the "is_pio" case it does nothing, and for the "is_mmio"
case it does "iounmap()".

So the GENERIC_IOMAP code is actually really simple and should just
work for pretty much everybody. All it requires is that fake kernel
virtual address range at PIO_OFFSET (you can override the default
values if you want - maybe your architecture really wants to put MMIO
in those virtual addresses, but I don't think there's a lot of reason
to generally want to do it)

But despite that, people think they should implement their own code,
and then they clearly get it HORRIBLY WRONG.

Anyway, this email ended up being a long explanation of what the code
_should_ do, in the hope that some enterprising kernel developer
decides "Oh, this sounds like an easy thing to fix". But you do need
to be able to test the end result at least a tiny bit.

Because I suspect that the real fix for sparc64 is to just get rid of
its broken non-GENERIC_IOMAP code, and just do "select GENERIC_IOMAP"

And I don't think sparc64 is the only architecture that should go "Oh,
I should just use GENERIC_IOMAP instead of implementing it badly by
hand".

Anyone?

               Linus