Re: [RFC PATCH 07/21] pci/tdisp: Introduce tsm module

Dan Williams <dan.j.williams@xxxxxxxxx> · Wed, 4 Sep 2024 16:28:45 -0700

Alexey Kardashevskiy wrote:
> 
> 
> On 4/9/24 09:51, Dan Williams wrote:
> > Alexey Kardashevskiy wrote:
> >> The module responsibilities are:
> >> 1. detect TEE support in a device and create nodes in the device's sysfs
> >> entry;
> >> 2. allow binding a PCI device to a VM for passing it through in a trusted
> >> manner;
> >> 3. store measurements/certificates/reports and provide access to those for
> >> the userspace via sysfs.
> >>
> >> This relies on the platform to register a set of callbacks,
> >> for both host and guest.
> >>
> >> And tdi_enabled in the device struct.
> > 
> > I had been holding out hope that when I got this patch the changelog
> > would give some justification for what folks had been whispering to me
> > in recent days: "hey Dan, looks like Alexey is completely ignoring the
> > PCI/TSM approach?".
> > 
> > Bjorn acked that approach here:
> > 
> > http://lore.kernel.org/20240419220729.GA307280@bhelgaas
> > 
> > It is in need of a refresh, preview here:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux.git/commit/?id=5807465b92ac
> > 
> > At best, I am disappointed that this RFC ignored it. More comments
> > below, but please do clarify if we are working together on a Bjorn-acked
> > direction, or not.
> 
> Together.
> 
> My problem with that patchset is that it only does connect/disconnect 
> and no TDISP business (and I need both for my exercise) and I was hoping 
> to see some TDISP-aware git tree but this has not happened yet so I 
> postponed rebasing onto it, due to the lack of time and also apparent 
> difference between yours and mine TSMs (and I had mine working before I 
> saw yours and focused on making things work for the starter). Sorry, I 
> should have spoken louder. Or listen better to that whispering. Or 
> rebase earlier.

Ok, this makes sense. This is definitely changelog material to clarify
assumptions, tradeoffs, and direction. The fact that the changelog said
nothing about those was, at a minimum, cause for concern.

[..]
> >> @@ -801,6 +802,7 @@ struct device {
> >>   	void	(*release)(struct device *dev);
> >>   	struct iommu_group	*iommu_group;
> >>   	struct dev_iommu	*iommu;
> >> +	struct tsm_tdi		*tdi;
> > 
> > No. The only known device model for TDIs is PCI devices, i.e. TDISP is a
> > PCI protocol. Even SPDM which is cross device-type generic did not touch
> > 'struct device'.
> 
> TDISP is PCI but DMA is not. This is for:
> [RFC PATCH 19/21] sev-guest: Stop changing encrypted page state for 
> TDISP devices
> 
> DMA layer deals with struct device and tries hard to avoid indirect _ops 
> calls so I was looking for a place for "tdi_enabled" (a bad name, 
> perhaps, may be call it "dma_encrypted", a few lines below).

The name and the fact that it exposes all of the TSM interfaces to the
driver core made it unclear if this oversharing was on purpose, or for
convenience / expediency?

I agree that 'struct device' should carry DMA mapping details, but the
full TDI context is so much more than that which makes it difficult to
understand the organizing principle of this data sharing.

> the flag and the pointer together for the RFC. I am hoping for a better 
> solution for 19/21, then I am absolutely moving tdi* to pci_dev (well, 
> drop these and just use yours).

Ok, so what patches are in the category of "temporary hacks to get
something going and a plan to replace them", and which are "firm
proposals looking for review feedback"?

[..]
> >> +/**
> >> + * struct tdisp_interface_id - TDISP INTERFACE_ID Definition
> >> + *
> >> + * @function_id: Identifies the function of the device hosting the TDI
> >> + * 15:0: @rid: Requester ID
> >> + * 23:16: @rseg: Requester Segment (Reserved if Requester Segment Valid is Clear)
> >> + * 24: @rseg_valid: Requester Segment Valid
> >> + * 31:25 – Reserved
> >> + * 8B - Reserved
> >> + */
> >> +struct tdisp_interface_id {
> >> +	union {
> >> +		struct {
> >> +			u32 function_id;
> >> +			u8 reserved[8];
> >> +		};
> >> +		struct {
> >> +			u16 rid;
> >> +			u8 rseg;
> >> +			u8 rseg_valid:1;
> > 
> > Linux typically avoids C-bitfields in hardware interfaces in favor of
> > bitfield.h macros.
> > >> +		};
> >> +	};
> >> +} __packed;
> > 
> > Does this need to be "packed"? Looks naturally aligned to pahole.
> 
> "__packed" is also a way to say it is a binary interface, I want to be 
> precise about this.

It's also a way to tell the compiler to turn off useful optimizations.

Don't these also need to be __le32 and __le16 for the multi-byte fields?

[..]
> > Same C-bitfield comment, as before, and what about big endian hosts?
> 
> Right, I'll get rid of c-bitfields in the common parts.
> 
> Although I am curious what big-endian platform is going to actually 
> support this.

The PCI DOE and CMA code is cross-CPU generic with endian annotations
where needed. Why would PCI TSM code get away with kicking that analysis
down the road?

[..]
> >> +/* Physical device descriptor responsible for IDE/TDISP setup */
> >> +struct tsm_dev {
> >> +	struct kref kref;
> > 
> > Another kref that begs the question why would a tsm_dev need its own
> > lifetime? This also goes back to the organization in the PCI/TSM
> > proposal that all TSM objects are at max bound to the lifetime of
> > whatever is shorter, the registration of the low-level TSM driver or the
> > PCI device itself.
> 
> 
> That proposal deals with PFs for now and skips TDIs. Since TDI needs its 
> place in pci_dev too, and I wanted to add the bare minimum to struct 
> device or pci_dev, I only add TDIs and each of them references a DEV. 
> Enough to get me going.

Fine for an RFC, but again please be upfront about what is firmer for
deeper scrutiny and what is softer to get the RFC standing.

> >> +	const struct attribute_group *ag;
> > 
> > PCI device attribute groups are already conveyed in a well known
> > (lifetime and user visibility) manner. What is motivating this
> > "re-imagining"?
> > 
> >> +	struct pci_dev *pdev; /* Physical PCI function #0 */
> >> +	struct tsm_spdm spdm;
> >> +	struct mutex spdm_mutex;
> > 
> > Is an spdm lock sufficient? I expect the device needs to serialize all
> > TSM communications, not just spdm? Documentation of the locking would
> > help.
> 
> What other communication do you mean here?

For example, a lock protecting entry into tsm_ops->connect(...), if that
operation is locked does there need to be a lower level spdm locking
context?

[..]
> >> +/*
> >> + * Enables IDE between the RC and the device.
> >> + * TEE Limited, IDE Cfg space and other bits are hardcoded
> >> + * as this is a sketch.
> > 
> > It would help to know how in depth to review the pieces if there were
> > more pointers of "this is serious proposal", and "this is a sketch".
> 
> Largely the latter, remember to keep appreciating the "release early" 
> aspect of it :)
> 
> It is a sketch which has been tested on the hardware with both KVM and 
> SNP VM which (I thought) has some value if posted before the LPC. I 
> should have made it clearer though.

It is definitely useful for getting the conversation started, but maybe
we need a SubmittingPatches style document that clarifies that RFC's
need to be explicit about if and where reviewers spend their time.

[..]
> > This feels kludgy. IDE is a fundamental mechanism of a PCI device why
> > would a PCI core helper not know how to extract the settings from a
> > pdev?
> > 
> > Something like:
> > 
> > pci_ide_setup_stream(pdev, i)
> 
> 
> It is unclear to me how we go about what stream(s) need(s) enabling and 
> what flags to set. Who decides - a driver? a daemon/user?

That is a good topic for the design document that Jason wanted. I had
been expecting that since stream IDs are a limited resource the kernel
needs to depend on userspace to handle allocation conflicts. Most of the
other settings would seem to be PCI core defaults unless and until
someone can point to a use case for a driver or userspace to have a
different opinion about those settings.

> >> +		if (ret) {
> >> +			pci_warn(tdev->pdev,
> >> +				 "Failed configuring SelectiveIDE#%d with %d\n",
> >> +				 i, ret);
> >> +			break;
> >> +		}
> >> +
> >> +		ret = pci_ide_set_sel_rid_assoc(rootport, i, true, 0, 0, 0xFFFF);
> >> +		if (ret)
> >> +			pci_warn(rootport,
> >> +				 "Failed configuring SelectiveIDE#%d rid1 with %d\n",
> >> +				 i, ret);
> >> +
> >> +		ret = pci_ide_set_sel(rootport, i,
> > 
> > Perhaps:
> > 
> > pci_ide_host_setup_stream(pdev, i)
> > 
> > ...I expect the helper should be able to figure out the rootport and RID
> > association.
> 
> Where will the helper get the properties from?

I expect it can retrieve it out of @pdev since the IDE settings belong
in 'struct pci_dev'.

[..]
> >> +static int tsm_dev_reclaim(struct tsm_dev *tdev, void *private_data)
> >> +{
> >> +	struct pci_dev *pdev = NULL;
> >> +	int ret;
> >> +
> >> +	if (WARN_ON(!tsm.ops->dev_reclaim))
> >> +		return -EPERM;
> > 
> > Similar comment about how this could happen and why crashing the kernel
> > is ok.
> 
> In this exercise, connect/reclaim are triggered via sysfs so this can 
> happen in my practice.
> 
> And it is WARN_ON, not BUG_ON, is it still called "crashing" (vs. 
> "panic", I never closely thought about it)?

You will see folks like Greg raise the concern that many users run with
"panic_on_warn" enabled. I expect a confidential VM is well advised to
enable that.

If it is a "can't ever happen outside of a kernel developer mistake"
then maybe WARN_ON() is ok, and you will see folks like Christoph assert
that WARN_ON() is good for that, but it should be reserved for cases
where rebooting might be a good idea if it fires.

> >> +
> >> +	/* Do not disconnect with active TDIs */
> >> +	for_each_pci_dev(pdev) {
> >> +		struct tsm_tdi *tdi = tsm_tdi_get(&pdev->dev);
> >> +
> >> +		if (tdi && tdi->tdev == tdev && tdi->data)
> >> +			return -EBUSY;
> > 
> > I would expect that removing things out of order causes violence, not
> > blocking it.
> > 
> > For example you can remove disk drivers while filesystems are still
> > mounted. What is the administrator's recourse if they *do* want to
> > shutdown the TSM layer all at once?
> 
> "rmmod tsm"

Is tsm_dev_reclaim() triggered by "rmmod tsm"? The concern is how to
reclaim when tsm_dev_reclaim() is sometimes returning EBUSY. Similar to
how the driver core enforces that driver unbind must succeed so should
TSM shutdown.

Also, the proposal Bjorn acked, because it comports with PCI sysfs
lifetime and visibility expectations, is that the TSM core is part of
the PCI core, just like DOE and CMA. The proposed way to shutdown TSM
operations is to unbind the low level TSM driver (TIO, TDX-Connect,
etc...) and that will forcefully destruct all TDI contexts with no
dangling -EBUSY cases.

Maybe tsm_dev_reclaim() is not triggered by TSM shutdown, but TSM
shutdown, like 'struct device_driver'.remove() should return 'void'.
Note, I know that 'struct device_driver' is not quite there yet on
->remove() returning 'void' instead of 'int', but that is the direction.

[..]
> > Why is refresh not "connect"? I.e. connecting an already connected
> > device refreshes the connection.
> 
> Really not sure about that. Either way I am ditching it for now.

Yeah, lets aggressively defer incremental features.

> >> +		ret = spdm_forward(&tdev->spdm, ret);
> >> +		if (ret < 0)
> >> +			break;
> >> +	}
> >> +	mutex_unlock(&tdev->spdm_mutex);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static void tsm_tdi_reclaim(struct tsm_tdi *tdi, void *private_data)
> >> +{
> >> +	int ret;
> >> +
> >> +	if (WARN_ON(!tsm.ops->tdi_reclaim))
> >> +		return;
> >> +
> >> +	mutex_lock(&tdi->tdev->spdm_mutex);
> >> +	while (1) {
> >> +		ret = tsm.ops->tdi_reclaim(tdi, private_data);
> >> +		if (ret <= 0)
> >> +			break;
> > 
> > What is involved in tdi "reclaim" separately from "unbind"?
> > "dev_reclaim" and "tdi_reclaim" seem less precise than "disconnect" and
> > "unbind".
> 
> The firmware operates at the finer granularity so there are 
> create+connect+disconnect+reclaim (for DEV and TDI). My verbs dictionary 
> evolved from having all of them in the tsm_ops to this subset which 
> tells the state the verb leaves the device at. This needs correction, yes.

I like the simplicity of the TIO verbs, but that does not preclude the
Linux verbs from having even coarser semantics.

[..]
> >> +/* In case BUS_NOTIFY_PCI_BUS_MASTER is no good, a driver can call pci_dev_tdi_validate() */
> > 
> > No. TDISP is a fundamental re-imagining of the PCI device security
> > model. It deserves first class support in the PCI core, not bolted on
> > support via bus notifiers.
> 
> This one is about sequencing. For example, writing a zero to BME breaks 
> a TDI after it moved to CONFIG_LOCKED. So, we either:
> 1) prevent zeroing BME or
> 2) delay this "validation" step (which also needs a better name).
> 
> If 1), then I can call "validate" from the PCI core before the driver's 
> probe.
> If 2), it is either a driver modification to call "validate" explicitly 
> or have a notifier like this. Or guest's sysfs - as a VM might want to 
> boot with a "shared" device, get to the userspace where some daemon 
> inspects the certificates/etc and "validates" the device only if it is 
> happy with the result. There may be even some vendor-specific device 
> configuration happening before the validation step.

Right, the guest might need to operate the device in shared mode to get
it ready for validation. At that point locking and validating the device
needs to be triggered by userspace talking to the PCI core before
reloading the driver to operate the device in private mode. That
conversion is probably best modeled as a hotplug event to leave the
shared world and enter the secured world.

That likely means that the userspace operation to transtion the device
to LOCKED also needs to take care of enabling BME and MSE independent of
any driver just based on the interface report. Then, loading the driver
can take the device from LOCKED to RUN when ready.

Yes, that implies an enlightened driver, for simplicity. We could later 
think about auto-validating devices by pre-loading golden measurements
into the kernel, but I expect the common case is that userspace needs to
do a bunch of work with the device-evidence and the verifier to get
itself comfortable with allowing the device to transition to the RUN
state.

> > I hesitate to keep commenting because this is so far off of the lifetime
> > and code organization expectations I thought we were negotiating with
> > the PCI/TSM series. So I will stop here for now.
> 
> Good call, sorry for the mess. Thanks for the review!

No harm done. The code is useful and the disconnect on the communication
/ documentation is now understood.

> ps: I'll just fix the things I did not comment on but I'm not ignoring them.

Sounds good.