RE: [RFC PATCH] vfio/pci: add PCIe TPH to device feature ioctl

Wathsala Wathawana Vithanage <wathsala.vithanage@xxxxxxx> · Mon, 3 Mar 2025 04:20:24 +0000

Hi Alex,

> > Linux v6.13 introduced the PCIe TLP Processing Hints (TPH) feature for
> > direct cache injection. As described in the relevant patch set [1],
> > direct cache injection in supported hardware allows optimal platform
> > resource utilization for specific requests on the PCIe bus. This
> > feature is currently available only for kernel device drivers.
> > However, user space applications, especially those whose performance
> > is sensitive to the latency of inbound writes as seen by a CPU core,
> > may benefit from using this information (E.g., DPDK cache stashing RFC
> > [2] or an HPC application running in a VM).
> >
> > This patch enables configuring of TPH from the user space via
> > VFIO_DEVICE_FEATURE IOCLT. It provides an interface to user space
> > drivers and VMMs to enable/disable the TPH feature on PCIe devices and
> > set steering tags in MSI-X or steering-tag table entries using
> > VFIO_DEVICE_FEATURE_SET flag or read steering tags from the kernel
> > using VFIO_DEVICE_FEATURE_GET to operate in device-specific mode.
> 
> Unless I'm missing it, the RFC in [2] doesn't make use of this proposal.  Is there
> published code anywhere that does use this interface?
> 
> > [1]
> > lore.kernel.org/linux-pci/20241002165954.128085-1-wei.huang2@xxxxxxx
> > [2]
> > inbox.dpdk.org/dev/20241021015246.304431-2-wathsala.vithanage@xxxxxxx
> >

The DPDK RFC in question is on hold for now because there is no way to get steering-tags
from the user space. 
Consequently, I had to stop working on [2] and start working on the kernel to get VFIO ready
for it. This was discussed in a DPDK community call sometime back.
https://mails.dpdk.org/archives/dev/2024-October/305408.html

> > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@xxxxxxx>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 163 +++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h        |  68 +++++++++++++
> >  2 files changed, 231 insertions(+)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c
> > b/drivers/vfio/pci/vfio_pci_core.c
> > index 586e49efb81b..d6dd0495b08b 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -29,6 +29,7 @@
> >  #include <linux/nospec.h>
> >  #include <linux/sched/mm.h>
> >  #include <linux/iommufd.h>
> > +#include <linux/pci-tph.h>
> >  #if IS_ENABLED(CONFIG_EEH)
> >  #include <asm/eeh.h>
> >  #endif
> > @@ -1510,6 +1511,165 @@ static int vfio_pci_core_feature_token(struct
> vfio_device *device, u32 flags,
> >  	return 0;
> >  }
> >
> > +static ssize_t vfio_pci_tph_uinfo_dup(struct vfio_pci_tph *tph,
> > +				      void __user *arg, size_t argsz,
> > +				      struct vfio_pci_tph_info **info) {
> > +	size_t minsz;
> > +
> > +	if (tph->count > VFIO_TPH_INFO_MAX)
> > +		return -EINVAL;
> > +	if (!tph->count)
> > +		return 0;
> > +
> > +	minsz = tph->count * sizeof(struct vfio_pci_tph_info);
> > +	if (minsz < argsz)
> > +		return -EINVAL;
> > +
> > +	*info = memdup_user(arg, minsz);
> > +	if (IS_ERR(info))
> > +		return PTR_ERR(info);
> > +
> > +	return minsz;
> > +}
> > +
> > +static int vfio_pci_feature_tph_st_op(struct vfio_pci_core_device *vdev,
> > +				      struct vfio_pci_tph *tph,
> > +				      void __user *arg, size_t argsz) {
> > +	int i, mtype, err = 0;
> > +	u32 cpu_uid;
> > +	struct vfio_pci_tph_info *info = NULL;
> > +	ssize_t data_size = vfio_pci_tph_uinfo_dup(tph, arg, argsz, &info);
> > +
> > +	if (data_size <= 0)
> > +		return data_size;
> > +
> > +	for (i = 0; i < tph->count; i++) {
> > +		if (!(info[i].cpu_id < nr_cpu_ids && cpu_present(info[i].cpu_id)))
> > +{
> 
> Not intuitive logic, imo.  This could easily be:
> 
> 		if (info[i].cpu_id >= nr_cpu_ids || !cpu_present(info[i].cpu_id))
> 

Agree, looks more intuitive.

> > +			info[i].err = -EINVAL;
> > +			continue;
> > +		}
> > +		cpu_uid = topology_core_id(info[i].cpu_id);
> > +		mtype = (info[i].flags & VFIO_TPH_MEM_TYPE_MASK) >>
> > +			VFIO_TPH_MEM_TYPE_SHIFT;
> > +
> > +		/* processing hints are always ignored */
> > +		info[i].ph_ignore = 1;
> > +
> > +		info[i].err = pcie_tph_get_cpu_st(vdev->pdev, mtype, cpu_uid,
> > +						  &info[i].st);
> > +		if (info[i].err)
> > +			continue;
> > +
> > +		if (tph->flags & VFIO_DEVICE_FEATURE_TPH_SET_ST) {
> > +			info[i].err = pcie_tph_set_st_entry(vdev->pdev,
> > +							    info[i].index,
> > +							    info[i].st);
> > +		}
> > +	}
> > +
> > +	if (copy_to_user(arg, info, data_size))
> > +		err = -EFAULT;
> > +
> > +	kfree(info);
> > +	return err;
> > +}
> > +
> > +
> > +static int vfio_pci_feature_tph_enable(struct vfio_pci_core_device *vdev,
> > +				       struct vfio_pci_tph *arg)
> > +{
> > +	int mode = arg->flags & VFIO_TPH_ST_MODE_MASK;
> > +
> > +	switch (mode) {
> > +	case VFIO_TPH_ST_NS_MODE:
> > +		return pcie_enable_tph(vdev->pdev, PCI_TPH_ST_NS_MODE);
> > +
> > +	case VFIO_TPH_ST_IV_MODE:
> > +		return pcie_enable_tph(vdev->pdev, PCI_TPH_ST_IV_MODE);
> > +
> > +	case VFIO_TPH_ST_DS_MODE:
> > +		return pcie_enable_tph(vdev->pdev, PCI_TPH_ST_DS_MODE);
> > +
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +
> > +}
> > +
> > +static int vfio_pci_feature_tph_disable(struct vfio_pci_core_device
> > +*vdev) {
> > +	pcie_disable_tph(vdev->pdev);
> > +	return 0;
> > +}
> > +
> > +static int vfio_pci_feature_tph_prepare(struct vfio_pci_tph __user *arg,
> > +					size_t argsz, u32 flags,
> > +					struct vfio_pci_tph *tph)
> > +{
> > +	u32 op;
> > +	int err = vfio_check_feature(flags, argsz,
> > +				 VFIO_DEVICE_FEATURE_SET |
> > +				 VFIO_DEVICE_FEATURE_GET,
> > +				 sizeof(struct vfio_pci_tph));
> > +	if (err != 1)
> > +		return err;
> 
> We don't seem to account for a host booted with pci=notph.
> 

pcie_enable_tph() looks for pci=notph and fails if it's set.
If pcie_enable_tph() fails, pcie_tph_get_cpu_st() and 
pcie_tph_set_st_entry() fail too.

Is it required to check pci=notph here as well?

> > +
> > +	if (copy_from_user(tph, arg, sizeof(struct vfio_pci_tph)))
> > +		return -EFAULT;
> > +
> > +	op = tph->flags & VFIO_DEVICE_FEATURE_TPH_OP_MASK;
> > +
> > +	switch (op) {
> > +	case VFIO_DEVICE_FEATURE_TPH_ENABLE:
> > +	case VFIO_DEVICE_FEATURE_TPH_DISABLE:
> > +	case VFIO_DEVICE_FEATURE_TPH_SET_ST:
> > +		return (flags & VFIO_DEVICE_FEATURE_SET) ? 0 : -EINVAL;
> > +
> > +	case VFIO_DEVICE_FEATURE_TPH_GET_ST:
> > +		return (flags & VFIO_DEVICE_FEATURE_GET) ? 0 : -EINVAL;
> 
> This is a convoluted mangling of an ioctl into a vfio feature.
> 

Do you prefer all four operations to fold under a single ioctl? Or split them
such that enabling and disabling TPH remains a VFIO SET feature while
SET_ST and GET_ST moved to a regular ioctl?

> > +
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +}
> > +
> > +static int vfio_pci_core_feature_tph(struct vfio_device *device, u32 flags,
> > +				     struct vfio_pci_tph __user *arg,
> > +				     size_t argsz)
> > +{
> > +	u32 op;
> > +	struct vfio_pci_tph tph;
> > +	void __user *uinfo;
> > +	size_t infosz;
> > +	struct vfio_pci_core_device *vdev =
> > +		container_of(device, struct vfio_pci_core_device, vdev);
> > +	int err = vfio_pci_feature_tph_prepare(arg, argsz, flags, &tph);
> > +
> > +	if (err)
> > +		return err;
> > +
> > +	op = tph.flags & VFIO_DEVICE_FEATURE_TPH_OP_MASK;
> > +
> > +	switch (op) {
> > +	case VFIO_DEVICE_FEATURE_TPH_ENABLE:
> > +		return vfio_pci_feature_tph_enable(vdev, &tph);
> > +
> > +	case VFIO_DEVICE_FEATURE_TPH_DISABLE:
> > +		return vfio_pci_feature_tph_disable(vdev);
> > +
> > +	case VFIO_DEVICE_FEATURE_TPH_GET_ST:
> > +	case VFIO_DEVICE_FEATURE_TPH_SET_ST:
> > +		uinfo = (u8 *)(arg) + offsetof(struct vfio_pci_tph, info);
> > +		infosz = argsz - sizeof(struct vfio_pci_tph);
> > +		return vfio_pci_feature_tph_st_op(vdev, &tph, uinfo, infosz);
> 
> This is effectively encoding a regular ioctl as a feature.  See below.
> 

Addressed this in the above comment.

> > +
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +}
> > +
> >  int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
> >  				void __user *arg, size_t argsz)
> >  {
> > @@ -1523,6 +1683,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device
> *device, u32 flags,
> >  		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
> >  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
> >  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> > +	case VFIO_DEVICE_FEATURE_PCI_TPH:
> > +		return vfio_pci_core_feature_tph(device, flags,
> > +						 arg, argsz);
> >  	default:
> >  		return -ENOTTY;
> >  	}
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index c8dbf8219c4f..608d57dfe279 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -1458,6 +1458,74 @@ struct vfio_device_feature_bus_master {  };
> > #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
> >
> > +/*
> > + * Upon VFIO_DEVICE_FEATURE_SET, enable or disable PCIe TPH or set
> > +steering tags
> > + * on the device. Data provided when setting this feature is a __u32
> > +with the
> > + * following flags. VFIO_DEVICE_FEATURE_TPH_ENABLE enables PCIe TPH
> > +in
> > + * no-steering-tag, interrupt-vector, or device-specific mode when
> > +feature flags
> > + * VFIO_TPH_ST_NS_MODE, VFIO_TPH_ST_IV_MODE, and
> VFIO_TPH_ST_DS_MODE
> > +are set
> > + * respectively.
> > + * VFIO_DEVICE_FEATURE_TPH_DISABLE disables PCIe TPH on the device.
> > + * VFIO_DEVICE_FEATURE_TPH_SET_ST set steering tags on a device at an
> > +index in
> > + * MSI-X or ST-table depending on the VFIO_TPH_ST_x_MODE flag used
> > +and device
> > + * capabilities. The caller can set multiple steering tags by passing
> > +an array
> > + * of vfio_pci_tph_info objects containing cpu_id, cache_level, and
> > + * MSI-X/ST-table index. The caller can also set the intended memory
> > +type and
> > + * the processing hint by setting VFIO_TPH_MEM_TYPE_x and
> > +VFIO_TPH_HINT_x flags,
> > + * respectively. The return value for each vfio_pci_tph_info object
> > +is stored in
> > + * err, with the steering-tag set on the device and the ph_ignore
> > +status bit
> > + * resulting from the steering-tag lookup operation. If err < 0, the
> > +values
> > + * stored in the st and ph_ignore fields should be considered invalid.
> > + *
> 
> Sorry, I don't understand ph_ignore as described here.  It's only ever set to 1.  I
> guess we assume the incoming state is zero.  But what does it mean?
> 

Agree, it's confusing and it has something to do with the current implementation
of TPH in the Kernel. 
PCIe firmware specification states root-port ACPI _DSM as the single source of
truth for steering-tags. Its job is to maintain a table of steering-tags indexed by
a tuple of CPU/Container UID, Cache-id and a processing-hint>. Each tuple is mapped
to two steering-tags for persistent or volatile memory (either or both depending on
the platform).
A caller looking to acquire a steering tag for a cache closer to a CPU will pass above
tuple in the format defined in the PCIe firmware specification. 
But PCIe spec also allows implementing TPH without processing-hints, in such cases
the _DSM is supposed to set ph_ignore bit in the return structure.
Current TPH implementation in the Linux kernel ignores some of these details,
it sets cache-id and the processing-hint in the above tuple to zeros. It also ignores the
ph_ignore bit return by the _DSM as well. 
However, I wrote this RFC to match the PCI firmware specification, therefore it sets
ph_ignore bit to 1 to inform the caller that it is ignored so that caller can make an
informed decision to stick with the default (bidirectional hint) or exit with error.
This is why the cache_level is ignored as well, ideally cache levels (0 for L1D, 1 for L2D,
Etc.) should be converted to a PPTT cache_id and passed into the _DSM.
I'm working on a separate PCI patch to fix above issues.

> > + * Upon VFIO_DEVICE_FEATURE_GET,  return steering tags to the caller.
> > + * VFIO_DEVICE_FEATURE_TPH_GET_ST returns steering tags to the caller.
> > + * The return values per vfio_pci_tph_info object are stored in the
> > + st,
> > + * ph_ignore, and err fields.
> 
> Why do we need to provide an interface to return the steering tags set by the
> user?  Seems like this could be a write-only, SET, interface.
> 

VFIO_DEVICE_FEATURE_TPH_GET_ST calls the _DSM to read the steering-tag
for a vector of tuples of a cpu-id, cache-id, processing-hint. It is for devices that operate
in device specific mode where they don't set steering-tags in the MSI-X or ST tables but
in another construct like a queue context accessible from the user/kernel space. 
For instance, this is what will be used by DPDK poll mode drivers as [2] fleshes
out. 

So, VFIO_DEVICE_FEATURE_TPH_GET_ST doesn't return steering-tags set by the
user. It's there to read the platform database of steering-tags which is an ACPI _DSM
bound each PCIe root port.

> > + */
> > +struct vfio_pci_tph_info {
> 
> This seems more of an _entry than an _info.  Note that vfio has various INFO
> ioctls which make this nomenclature confusing.
> 

It is, it's more of a struct that describes the request.
Perhaps vfio_pci_tph_request_descriptor, but that's too long. 
Suggestions are welcome.

> > +	/* in */
> > +	__u32 cpu_id;
> 
> The logical ID?
> 
Yes, this is logical ID.

> > +	__u32 cache_level;
> 
> Zero based?  1-based?  How many cache levels are there?  What's valid here?
> 
It's Zero based. This is currently ignored due to limitations in the kernel TPH. 
One way to validate would be iterating through the topology to find it, decrementing
a copy_of_cache_level as we move further away from the cpu_id until the value
reaches zero. If loop terminates before copy_of_cache_level reaching zero, -EINVAL.
Is that a good approach?

> > +	__u8  flags;
> > +#define VFIO_TPH_MEM_TYPE_MASK		0x1
> > +#define VFIO_TPH_MEM_TYPE_SHIFT		0
> > +#define VFIO_TPH_MEM_TYPE_VMEM		0	/* Request
> volatile memory ST */
> > +#define VFIO_TPH_MEM_TYPE_PMEM		1	/* Request
> persistent memory ST */
> 
> Is there a relation to the cache_level here?  Spec references here and below,
> assuming those are relevant to these values.
> 
There is no relation to cache level. Return from the _DSM has PMEM and VMEM
sections. If either of those are not supported a valid bit is set to 0 in the return.

PCIe spec has four processing-hints, VFIO_TPH_HINT_* are there to set them
in the flags field. Hints are forced to VFIO_TPH_HINT_BIDIR by current TPH
implementation as I described above. 

> > +
> > +#define VFIO_TPH_HINT_MASK		0x3
> > +#define VFIO_TPH_HINT_SHIFT		1
> > +#define VFIO_TPH_HINT_BIDIR		0
> > +#define VFIO_TPH_HINT_REQSTR		(1 << VFIO_TPH_HINT_SHIFT)
> > +#define VFIO_TPH_HINT_TARGET		(2 << VFIO_TPH_HINT_SHIFT)
> > +#define VFIO_TPH_HINT_TARGET_PRIO	(3 << VFIO_TPH_HINT_SHIFT)
> 
> There needs to be a __u8 padding in here somewhere or flags extended to
> __u16.
> 
> > +	__u16 index;			/* MSI-X/ST-table index to set ST */
> > +	/* out */
> > +	__u16 st;			/* Steering-Tag */
> 
> Sorry if I'm just not familiar with TPH, but why do we need to return the ST?
> Doesn't hardware make use of the ST index and the physical value gets applied
> automatically in HW?
> 

Device specific mode requires this. For instance, intel E810 NIC can set
Steering-tags in queue context which is in user-space when used with DPDK.
For such use cases we must return steering-tags read from the _DSM. 
Going back to DPDK RFC mentioned here, if TPH gets enabled in VFIO, the E810 
poll mode driver in the DPDK will ask the kernel a steering-tag for a combination
of a cpu-id, a cache-level and a processing-hint. In response the kernel will
invoke the ACPI _DSM for the root port of the device via pcie_tph_get_cpu_st()
and return the steering-tag back to the user. E810 driver will set the returned tag
in appropriate queue context.

For cases where steering-tag is not required in user-space (VMMs), caller asks the
kernel to set steering-tag that corresponds the cpu-id, cache-level, and PH
combination at a given MSI-X vector entry or ST table in config space. For such
cases too, the kernel will read the steering-tag from the _DSM and set the tag
in the corresponding MSI-X or ST table entry.

> > +	__u8  ph_ignore;		/* Processing hint was ignored by */
> 
> Padding/alignment, same as above.  "ignored by..."  By what?  Is that an error?
> 

An error. "Processing hint was ignored by the platform"

> > +	__s32 err;			/* Error on getting/setting Steering-
> Tag*/
> > +};
> 
> Generally we'd expect a system call either works or fails.  Why do we need per
> entry error report?  Can't we validate and prepare the entire operation before
> setting any of it into the device?  Ultimately we're just writing hints to config
> space or MSI-X table space, so the write operation itself is not likely to be the
> point of failure.
> 

Write to MSI-X won't fail but reading the steering-tag from the _DSM for an
invalid <cpu_id, cach_id, ph_hint> combo can fail in both GET_ST and SET_ST
cases.
We can change this to an all or nothing interface, such that success if all the
entries are valid and otherwise if at least one is invalid. But in that case the
caller may not know which entries were invalid. If you think that's ok, I can
change it.

> > +
> > +struct vfio_pci_tph {
> > +	__u32 argsz;			/* Size of vfio_pci_tph and info[] */
> > +	__u32 flags;
> > +#define VFIO_DEVICE_FEATURE_TPH_OP_MASK		0x7
> > +#define VFIO_DEVICE_FEATURE_TPH_OP_SHIFT	3
> > +#define VFIO_DEVICE_FEATURE_TPH_ENABLE		0	/*
> Enable TPH on device */
> > +#define VFIO_DEVICE_FEATURE_TPH_DISABLE	1	/* Disable TPH
> on device */
> > +#define VFIO_DEVICE_FEATURE_TPH_GET_ST		2	/* Get
> steering-tags */
> > +#define VFIO_DEVICE_FEATURE_TPH_SET_ST		4	/* Set
> steering-rags */
> 
> s/rags/tags/
> 
> vfio device features already have GET and SET as part of their base structure, why
> are they duplicated here?  It really seems like there are two separate features
> here, one that allows enabling with a given mode or disable, and another that
> allows writing specific steering tags.  Both seem like they could be SET-only
> features.  It's also not clear to me that there's a significant advantage to providing
> an array of steering tags.  Isn't updating STs an infrequent operation and likely
> bound to at most 2K tags in the case of MSI-X?
> 

VFIO_DEVICE_FEATURE_TPH_ENABLE and VFIO_DEVICE_FEATURE_TPH_DISABLE
are SET features.
Since, VFIO_DEVICE_FEATURE_TPH_SET_ST requires writing to ST table, so that too
falls under SET features.
Therefore, these flags are only valid when VFIO_DEVICE_FEATURE_SET flag is set.
The only GET features use case here is VFIO_DEVICE_FEATURE_TPH_GET_ST that
reads the steering-tags from the _DSM and returns it back to caller. Only valid
with VFIO_DEVICE_FEATURE_GET flag. 
These are checked in vfio_pci_feature_tph_prepare().

As I mentioned earlier, does it make sense to leave enable/disable under VFIO
Feature and move GET_ST and SET_ST to a separate ioctl?

There isn't much rational to providing an array of tuples other than following
the format VFIO_DEVICE_SET_IRQS that sets eventfds.

> > +
> > +#define	VFIO_TPH_ST_MODE_MASK	(0x3 <<
> VFIO_DEVICE_FEATURE_TPH_OP_SHIFT)
> > +#define	VFIO_TPH_ST_NS_MODE	(0 <<
> VFIO_DEVICE_FEATURE_TPH_OP_SHIFT)
> > +#define	VFIO_TPH_ST_IV_MODE	(1 <<
> VFIO_DEVICE_FEATURE_TPH_OP_SHIFT)
> > +#define	VFIO_TPH_ST_DS_MODE	(2 <<
> VFIO_DEVICE_FEATURE_TPH_OP_SHIFT)
> > +	__u32 count;				/* Number of entries in info[]
> */
> > +	struct vfio_pci_tph_info info[];
> > +#define VFIO_TPH_INFO_MAX	64		/* Max entries allowed
> in info[] */
> 
> This seems to match the limit if the table is located in extended config space, but
> it's not particularly relevant if the table is located in MSI-X space.  Why is this a
> good limit?

Yes, for MSI-X the caller may have to invoke the ioctl multiple times if more
than 64 entries need to be updated. 

> 
> Also, try to keep these all in 80 column.  Thanks,
> 

I will, try to keep these under 80 columns.

Thanks Alex, for your input. Let me know what you think about moving SET_ST
and GET_ST to an ioctl and keep ENABLE/DISABLE TPH as a vfio feature.

--Wathsala