Re: [RFC v2 09/20] PCI/CMA: Expose in sysfs whether devices are authenticated

Lukas Wunner <lukas@xxxxxxxxx> · Sat, 1 Mar 2025 19:01:39 +0100

On Thu, Feb 27, 2025 at 05:39:53PM -0800, Greg KH wrote:
> On Thu, Feb 27, 2025 at 11:42:46PM +0100, Lukas Wunner wrote:
> > On Thu, Feb 27, 2025 at 03:16:40AM -0800, Greg KH wrote:
> > > I don't like this "if it's present we still don't know if the device
> > > supports this", as that is not normally the "sysfs way" here.  Why must
> > > it be present in those situations?
> > 
> > That's explained above.
> 
> Not really, you just say "downgrade attacks", which is not something
> that we need to worry about, right?

A downgrade attack means duping the victim into believing that only
a weaker security mode is supported.  E.g. only sha1, but not sha256.

In this context, downgrade attack means duping the kernel or user
into believing that SPDM authentication is unsupported, even though it is.

https://en.wikipedia.org/wiki/Downgrade_attack

That's definitely something we need to be aware of and guard against,
otherwise what's the point of authenticating in the first place.

> > Unfortunately there is no (signed) bit in Config Space which tells us
> > whether authentication is supported by a PCI device.  Rather, it is
> > necessary to exchange several messages with the device through a
> > DOE mailbox in config space to determine that.  I'm worried that an
> > attacker deliberately "glitches" those DOE exchanges and thus creates
> > the appearance that the device does not support authentication.
> 
> That's a hardware glitch, and if that happens, then it will show a 0 and
> that's the same as not being present at all, right?

No, the "authenticated" attribute is not present in sysfs if authentication
is unsupported.

The downgrade attack protection comprises exposing the attribute if it
could not be determined whether authentication is supported or not,
and returning an error (ENOTTY) on read or write.

User space applications need to check anyway whether read() or write()
failed for some reason.  E.g. if the device is hot-removed concurrently,
the read() system call returns ENODEV.  So returning ENOTTY is just
another error that can occur on access to the attribute.

The idea is that user space typically wants to check whether the attribute
contains "1", signifying that the device was authenticated successfully.
Hence a return value of "0" or any error code signifies that the device
is not authenticated.

And if user space wants to check whether authentication is supported at all,
it checks for presence of the sysfs attribute.  Hence exposing the attribute
if support could not be determined is a safety net to not mislead user space
that the device does not support authentication.

For PCIe, glitching the hardware (the electric signals exchanged with
the device) is indeed one way to disrupt the DOE and SPDM exchanges.

However the SPDM protocol has not only been adopted by PCIe, but also
other buses, in particular SCSI and ATA.  And in those cases, glitching
the SPDM exchanges may be a pure software thing.  (Think iSCSI communication
with storage devices in a remote rack or data center.)

Damien Le Moal has explicitly requested that the user space ABI for SPDM
is consistent across buses.  So the downgrade attack protection can be
taken advantage of by those other buses as well.

> > Let's say the user's policy is to trust legacy devices which do not
> > support authentication, but require authentication for newer NVMe drives
> > from a certain vendor.  An attacker may manipulate an authentication-capable
> > NVMe drive from that vendor, whereupon it will fail authentication.
> > But the attacker can trick the user into trusting the device by glitching
> > the DOE exchanges.
> 
> Again, are we now claiming that Linux needs to support "hardware
> glitching"?  Is that required somewhere?

Required?  It's simply prudent to protect users from being duped into
thinking the device doesn't support authentication.

> I think if the DOE exchanges
> fail, we just trust the device as we have to trust something, right?

If the DOE exchanges fail, something fishy is going on.
Why should we hide that fact from the user?

> > The device needs to be re-enumerated by the PCI core to retry
> > determining its authentication capability.  That's why the
> > sysfs documentation says the user may exercise the "remove"
> > and "rescan" attributes to retry authentication.
> 
> But how does it know that?

Because reads and writes to the attribute return ENOTTY.

> remove and recan is a huge sledgehammer, and
> an amazing one if it even works on most hardware.  Don't make it part of
> any normal process please.

It's not a normal process.  It's manual recovery in case of a
potential attack.  The user can also choose to unplug the device
or reboot the machine.  That's arguably a bigger sledgehammer.

> It's the error, don't do that.  If an error is going to happen, then
> don't have the file there.  That's the way sysfs works, it's not a
> "let's add all possible files and then make userspace open them all and
> see if an error happens to determine what really is present for this
> device" model.  It's a "if a file is there, that attribute is there and
> we can read it".

The point is that if the file isn't there even though the device might
support authentication, we're creating a false and dangerous illusion.
This is different from other attributes which don't have that quality.

> > > > Alternatively, authentication success might be signaled to user space
> > > > through a uevent, whereupon it may bind a (blacklisted) driver.
> > > 
> > > How will that happen?
> > 
> > The SPDM library can be amended to signal a uevent when authentication
> > succeeds or fails and user space can then act on it.  I imagine systemd
> > or some other daemon might listen to such events and do interesting things,
> > such as binding a driver once authentication succeeds.
> 
> That's a new user/kernel api and should be designed ONLY if you actually
> need it and have a user.  Otherwise let's just wait for later for that.

Of course.  Again, the commit message makes suggestions for future
extensions to justify the change.  Those are just ideas.  Whether
and how they are implemented remains to be seen.  Signaling a uevent
on authentication success or failure seems like an obvious idea,
hence I included it in the commit message.

I fear if I don't include those ideas in the commit message, someone
will come along and ask "why do you need this at all?", thus putting
into question the whole set of authentication patches.

> > > If an attacker can consume kernel memory to cause this to happen you
> > > have bigger problems.  That's not the kernel's issue here at all.
> > > 
> > > And "disable communication" means "we just don't support it as the
> > > device doesn't say it does", so again, why does that matter?
> > 
> > Reacting to potential attacks sure is the kernel's business.
> 
> Reacting to real, software attacks is the kernel's business.  Reacting
> to possible hardware issues that are just theoretical is not.

We have fundamental disagreement whether certain attacks need to be taken
seriously.  Which reminds me of...

   "the final topic on the agenda was the corporate attempt at
    security consciousness raising; a shouting match ensued,
    in the course of which several and various reputations
    were sullied, certain paranoid reactions were taken less
    than seriously, and no great meeting of the minds was met."

   [minutes of the uucp-lovers interest group, 20 April 1983,
    from "A Quarter Century of UNIX" (1994) page 113]
    https://wiki.tuhs.org/lib/exe/fetch.php?media=publications:qcu.pdf

Looks like we're just upholding the time honored tradition of
UNIX security disagreements!

Thanks,

Lukas