On Mon, 3 Mar 2025 04:20:24 +0000 Wathsala Wathawana Vithanage <wathsala.vithanage@xxxxxxx> wrote: > Hi Alex, > > > > Linux v6.13 introduced the PCIe TLP Processing Hints (TPH) feature for > > > direct cache injection. As described in the relevant patch set [1], > > > direct cache injection in supported hardware allows optimal platform > > > resource utilization for specific requests on the PCIe bus. This > > > feature is currently available only for kernel device drivers. > > > However, user space applications, especially those whose performance > > > is sensitive to the latency of inbound writes as seen by a CPU core, > > > may benefit from using this information (E.g., DPDK cache stashing RFC > > > [2] or an HPC application running in a VM). > > > > > > This patch enables configuring of TPH from the user space via > > > VFIO_DEVICE_FEATURE IOCLT. It provides an interface to user space > > > drivers and VMMs to enable/disable the TPH feature on PCIe devices and > > > set steering tags in MSI-X or steering-tag table entries using > > > VFIO_DEVICE_FEATURE_SET flag or read steering tags from the kernel > > > using VFIO_DEVICE_FEATURE_GET to operate in device-specific mode. > > > > Unless I'm missing it, the RFC in [2] doesn't make use of this proposal. Is there > > published code anywhere that does use this interface? > > > > > [1] > > > lore.kernel.org/linux-pci/20241002165954.128085-1-wei.huang2@xxxxxxx > > > [2] > > > inbox.dpdk.org/dev/20241021015246.304431-2-wathsala.vithanage@xxxxxxx > > > > > The DPDK RFC in question is on hold for now because there is no way to get steering-tags > from the user space. > Consequently, I had to stop working on [2] and start working on the kernel to get VFIO ready > for it. This was discussed in a DPDK community call sometime back. > https://mails.dpdk.org/archives/dev/2024-October/305408.html I think the userspace and kernel support need to be co-developed, we don't want to be adding kernel code and uAPIs that don't have users. > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage@xxxxxxx> > > > --- > > > drivers/vfio/pci/vfio_pci_core.c | 163 +++++++++++++++++++++++++++++++ > > > include/uapi/linux/vfio.h | 68 +++++++++++++ > > > 2 files changed, 231 insertions(+) > > > > > > diff --git a/drivers/vfio/pci/vfio_pci_core.c > > > b/drivers/vfio/pci/vfio_pci_core.c > > > index 586e49efb81b..d6dd0495b08b 100644 > > > --- a/drivers/vfio/pci/vfio_pci_core.c > > > +++ b/drivers/vfio/pci/vfio_pci_core.c > > > @@ -29,6 +29,7 @@ > > > #include <linux/nospec.h> > > > #include <linux/sched/mm.h> > > > #include <linux/iommufd.h> > > > +#include <linux/pci-tph.h> > > > #if IS_ENABLED(CONFIG_EEH) > > > #include <asm/eeh.h> > > > #endif > > > @@ -1510,6 +1511,165 @@ static int vfio_pci_core_feature_token(struct > > vfio_device *device, u32 flags, > > > return 0; > > > } > > > > > > +static ssize_t vfio_pci_tph_uinfo_dup(struct vfio_pci_tph *tph, > > > + void __user *arg, size_t argsz, > > > + struct vfio_pci_tph_info **info) { > > > + size_t minsz; > > > + > > > + if (tph->count > VFIO_TPH_INFO_MAX) > > > + return -EINVAL; > > > + if (!tph->count) > > > + return 0; > > > + > > > + minsz = tph->count * sizeof(struct vfio_pci_tph_info); > > > + if (minsz < argsz) > > > + return -EINVAL; > > > + > > > + *info = memdup_user(arg, minsz); > > > + if (IS_ERR(info)) > > > + return PTR_ERR(info); > > > + > > > + return minsz; > > > +} > > > + > > > +static int vfio_pci_feature_tph_st_op(struct vfio_pci_core_device *vdev, > > > + struct vfio_pci_tph *tph, > > > + void __user *arg, size_t argsz) { > > > + int i, mtype, err = 0; > > > + u32 cpu_uid; > > > + struct vfio_pci_tph_info *info = NULL; > > > + ssize_t data_size = vfio_pci_tph_uinfo_dup(tph, arg, argsz, &info); > > > + > > > + if (data_size <= 0) > > > + return data_size; > > > + > > > + for (i = 0; i < tph->count; i++) { > > > + if (!(info[i].cpu_id < nr_cpu_ids && cpu_present(info[i].cpu_id))) > > > +{ > > > > Not intuitive logic, imo. This could easily be: > > > > if (info[i].cpu_id >= nr_cpu_ids || !cpu_present(info[i].cpu_id)) > > > > Agree, looks more intuitive. > > > > + info[i].err = -EINVAL; > > > + continue; > > > + } > > > + cpu_uid = topology_core_id(info[i].cpu_id); > > > + mtype = (info[i].flags & VFIO_TPH_MEM_TYPE_MASK) >> > > > + VFIO_TPH_MEM_TYPE_SHIFT; > > > + > > > + /* processing hints are always ignored */ > > > + info[i].ph_ignore = 1; > > > + > > > + info[i].err = pcie_tph_get_cpu_st(vdev->pdev, mtype, cpu_uid, > > > + &info[i].st); > > > + if (info[i].err) > > > + continue; > > > + > > > + if (tph->flags & VFIO_DEVICE_FEATURE_TPH_SET_ST) { > > > + info[i].err = pcie_tph_set_st_entry(vdev->pdev, > > > + info[i].index, > > > + info[i].st); > > > + } > > > + } > > > + > > > + if (copy_to_user(arg, info, data_size)) > > > + err = -EFAULT; > > > + > > > + kfree(info); > > > + return err; > > > +} > > > + > > > + > > > +static int vfio_pci_feature_tph_enable(struct vfio_pci_core_device *vdev, > > > + struct vfio_pci_tph *arg) > > > +{ > > > + int mode = arg->flags & VFIO_TPH_ST_MODE_MASK; > > > + > > > + switch (mode) { > > > + case VFIO_TPH_ST_NS_MODE: > > > + return pcie_enable_tph(vdev->pdev, PCI_TPH_ST_NS_MODE); > > > + > > > + case VFIO_TPH_ST_IV_MODE: > > > + return pcie_enable_tph(vdev->pdev, PCI_TPH_ST_IV_MODE); > > > + > > > + case VFIO_TPH_ST_DS_MODE: > > > + return pcie_enable_tph(vdev->pdev, PCI_TPH_ST_DS_MODE); > > > + > > > + default: > > > + return -EINVAL; > > > + } > > > + > > > +} > > > + > > > +static int vfio_pci_feature_tph_disable(struct vfio_pci_core_device > > > +*vdev) { > > > + pcie_disable_tph(vdev->pdev); > > > + return 0; > > > +} > > > + > > > +static int vfio_pci_feature_tph_prepare(struct vfio_pci_tph __user *arg, > > > + size_t argsz, u32 flags, > > > + struct vfio_pci_tph *tph) > > > +{ > > > + u32 op; > > > + int err = vfio_check_feature(flags, argsz, > > > + VFIO_DEVICE_FEATURE_SET | > > > + VFIO_DEVICE_FEATURE_GET, > > > + sizeof(struct vfio_pci_tph)); > > > + if (err != 1) > > > + return err; > > > > We don't seem to account for a host booted with pci=notph. > > > > pcie_enable_tph() looks for pci=notph and fails if it's set. > If pcie_enable_tph() fails, pcie_tph_get_cpu_st() and > pcie_tph_set_st_entry() fail too. > > Is it required to check pci=notph here as well? Does it make sense to probe affirmatively for a feature that can't be enabled? I think I previously also overlooked that we never actually check that the device supports TPH at probe time. > > > + > > > + if (copy_from_user(tph, arg, sizeof(struct vfio_pci_tph))) > > > + return -EFAULT; > > > + > > > + op = tph->flags & VFIO_DEVICE_FEATURE_TPH_OP_MASK; > > > + > > > + switch (op) { > > > + case VFIO_DEVICE_FEATURE_TPH_ENABLE: > > > + case VFIO_DEVICE_FEATURE_TPH_DISABLE: > > > + case VFIO_DEVICE_FEATURE_TPH_SET_ST: > > > + return (flags & VFIO_DEVICE_FEATURE_SET) ? 0 : -EINVAL; > > > + > > > + case VFIO_DEVICE_FEATURE_TPH_GET_ST: > > > + return (flags & VFIO_DEVICE_FEATURE_GET) ? 0 : -EINVAL; > > > > This is a convoluted mangling of an ioctl into a vfio feature. > > > > Do you prefer all four operations to fold under a single ioctl? Or split them > such that enabling and disabling TPH remains a VFIO SET feature while > SET_ST and GET_ST moved to a regular ioctl? Splitting the functionality between a feature and a new ioctl doesn't make any sense. As I noted, I can imagine this interface being two related features where one allows the TPH state to be set enabled or disabled and the other allows setting steering tags. > > > + > > > + default: > > > + return -EINVAL; > > > + } > > > +} > > > + > > > +static int vfio_pci_core_feature_tph(struct vfio_device *device, u32 flags, > > > + struct vfio_pci_tph __user *arg, > > > + size_t argsz) > > > +{ > > > + u32 op; > > > + struct vfio_pci_tph tph; > > > + void __user *uinfo; > > > + size_t infosz; > > > + struct vfio_pci_core_device *vdev = > > > + container_of(device, struct vfio_pci_core_device, vdev); > > > + int err = vfio_pci_feature_tph_prepare(arg, argsz, flags, &tph); > > > + > > > + if (err) > > > + return err; > > > + > > > + op = tph.flags & VFIO_DEVICE_FEATURE_TPH_OP_MASK; > > > + > > > + switch (op) { > > > + case VFIO_DEVICE_FEATURE_TPH_ENABLE: > > > + return vfio_pci_feature_tph_enable(vdev, &tph); > > > + > > > + case VFIO_DEVICE_FEATURE_TPH_DISABLE: > > > + return vfio_pci_feature_tph_disable(vdev); > > > + > > > + case VFIO_DEVICE_FEATURE_TPH_GET_ST: > > > + case VFIO_DEVICE_FEATURE_TPH_SET_ST: > > > + uinfo = (u8 *)(arg) + offsetof(struct vfio_pci_tph, info); > > > + infosz = argsz - sizeof(struct vfio_pci_tph); > > > + return vfio_pci_feature_tph_st_op(vdev, &tph, uinfo, infosz); > > > > This is effectively encoding a regular ioctl as a feature. See below. > > > > Addressed this in the above comment. > > > > + > > > + default: > > > + return -EINVAL; > > > + } > > > +} > > > + > > > int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, > > > void __user *arg, size_t argsz) > > > { > > > @@ -1523,6 +1683,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device > > *device, u32 flags, > > > return vfio_pci_core_pm_exit(device, flags, arg, argsz); > > > case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: > > > return vfio_pci_core_feature_token(device, flags, arg, argsz); > > > + case VFIO_DEVICE_FEATURE_PCI_TPH: > > > + return vfio_pci_core_feature_tph(device, flags, > > > + arg, argsz); > > > default: > > > return -ENOTTY; > > > } > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > > index c8dbf8219c4f..608d57dfe279 100644 > > > --- a/include/uapi/linux/vfio.h > > > +++ b/include/uapi/linux/vfio.h > > > @@ -1458,6 +1458,74 @@ struct vfio_device_feature_bus_master { }; > > > #define VFIO_DEVICE_FEATURE_BUS_MASTER 10 > > > > > > +/* > > > + * Upon VFIO_DEVICE_FEATURE_SET, enable or disable PCIe TPH or set > > > +steering tags > > > + * on the device. Data provided when setting this feature is a __u32 > > > +with the > > > + * following flags. VFIO_DEVICE_FEATURE_TPH_ENABLE enables PCIe TPH > > > +in > > > + * no-steering-tag, interrupt-vector, or device-specific mode when > > > +feature flags > > > + * VFIO_TPH_ST_NS_MODE, VFIO_TPH_ST_IV_MODE, and > > VFIO_TPH_ST_DS_MODE > > > +are set > > > + * respectively. > > > + * VFIO_DEVICE_FEATURE_TPH_DISABLE disables PCIe TPH on the device. > > > + * VFIO_DEVICE_FEATURE_TPH_SET_ST set steering tags on a device at an > > > +index in > > > + * MSI-X or ST-table depending on the VFIO_TPH_ST_x_MODE flag used > > > +and device > > > + * capabilities. The caller can set multiple steering tags by passing > > > +an array > > > + * of vfio_pci_tph_info objects containing cpu_id, cache_level, and > > > + * MSI-X/ST-table index. The caller can also set the intended memory > > > +type and > > > + * the processing hint by setting VFIO_TPH_MEM_TYPE_x and > > > +VFIO_TPH_HINT_x flags, > > > + * respectively. The return value for each vfio_pci_tph_info object > > > +is stored in > > > + * err, with the steering-tag set on the device and the ph_ignore > > > +status bit > > > + * resulting from the steering-tag lookup operation. If err < 0, the > > > +values > > > + * stored in the st and ph_ignore fields should be considered invalid. > > > + * > > > > Sorry, I don't understand ph_ignore as described here. It's only ever set to 1. I > > guess we assume the incoming state is zero. But what does it mean? > > > > Agree, it's confusing and it has something to do with the current implementation > of TPH in the Kernel. > PCIe firmware specification states root-port ACPI _DSM as the single source of > truth for steering-tags. Its job is to maintain a table of steering-tags indexed by > a tuple of CPU/Container UID, Cache-id and a processing-hint>. Each tuple is mapped > to two steering-tags for persistent or volatile memory (either or both depending on > the platform). > A caller looking to acquire a steering tag for a cache closer to a CPU will pass above > tuple in the format defined in the PCIe firmware specification. > But PCIe spec also allows implementing TPH without processing-hints, in such cases > the _DSM is supposed to set ph_ignore bit in the return structure. > Current TPH implementation in the Linux kernel ignores some of these details, > it sets cache-id and the processing-hint in the above tuple to zeros. It also ignores the > ph_ignore bit return by the _DSM as well. > However, I wrote this RFC to match the PCI firmware specification, therefore it sets > ph_ignore bit to 1 to inform the caller that it is ignored so that caller can make an > informed decision to stick with the default (bidirectional hint) or exit with error. > This is why the cache_level is ignored as well, ideally cache levels (0 for L1D, 1 for L2D, > Etc.) should be converted to a PPTT cache_id and passed into the _DSM. > I'm working on a separate PCI patch to fix above issues. > > > > + * Upon VFIO_DEVICE_FEATURE_GET, return steering tags to the caller. > > > + * VFIO_DEVICE_FEATURE_TPH_GET_ST returns steering tags to the caller. > > > + * The return values per vfio_pci_tph_info object are stored in the > > > + st, > > > + * ph_ignore, and err fields. > > > > Why do we need to provide an interface to return the steering tags set by the > > user? Seems like this could be a write-only, SET, interface. > > > > VFIO_DEVICE_FEATURE_TPH_GET_ST calls the _DSM to read the steering-tag > for a vector of tuples of a cpu-id, cache-id, processing-hint. It is for devices that operate > in device specific mode where they don't set steering-tags in the MSI-X or ST tables but > in another construct like a queue context accessible from the user/kernel space. > For instance, this is what will be used by DPDK poll mode drivers as [2] fleshes > out. > > So, VFIO_DEVICE_FEATURE_TPH_GET_ST doesn't return steering-tags set by the > user. It's there to read the platform database of steering-tags which is an ACPI _DSM > bound each PCIe root port. > > > > + */ > > > +struct vfio_pci_tph_info { > > > > This seems more of an _entry than an _info. Note that vfio has various INFO > > ioctls which make this nomenclature confusing. > > > > It is, it's more of a struct that describes the request. > Perhaps vfio_pci_tph_request_descriptor, but that's too long. > Suggestions are welcome. > > > > > + /* in */ > > > + __u32 cpu_id; > > > > The logical ID? > > > Yes, this is logical ID. > > > > + __u32 cache_level; > > > > Zero based? 1-based? How many cache levels are there? What's valid here? > > > It's Zero based. This is currently ignored due to limitations in the kernel TPH. > One way to validate would be iterating through the topology to find it, decrementing > a copy_of_cache_level as we move further away from the cpu_id until the value > reaches zero. If loop terminates before copy_of_cache_level reaching zero, -EINVAL. > Is that a good approach? Sounds like a try and hope for the best interface. How would userspace realistically infer the cache levels? I don't think we want vfio describing that. > > > + __u8 flags; > > > +#define VFIO_TPH_MEM_TYPE_MASK 0x1 > > > +#define VFIO_TPH_MEM_TYPE_SHIFT 0 > > > +#define VFIO_TPH_MEM_TYPE_VMEM 0 /* Request > > volatile memory ST */ > > > +#define VFIO_TPH_MEM_TYPE_PMEM 1 /* Request > > persistent memory ST */ > > > > Is there a relation to the cache_level here? Spec references here and below, > > assuming those are relevant to these values. > > > There is no relation to cache level. Return from the _DSM has PMEM and VMEM > sections. If either of those are not supported a valid bit is set to 0 in the return. > > PCIe spec has four processing-hints, VFIO_TPH_HINT_* are there to set them > in the flags field. Hints are forced to VFIO_TPH_HINT_BIDIR by current TPH > implementation as I described above. > > > > + > > > +#define VFIO_TPH_HINT_MASK 0x3 > > > +#define VFIO_TPH_HINT_SHIFT 1 > > > +#define VFIO_TPH_HINT_BIDIR 0 > > > +#define VFIO_TPH_HINT_REQSTR (1 << VFIO_TPH_HINT_SHIFT) > > > +#define VFIO_TPH_HINT_TARGET (2 << VFIO_TPH_HINT_SHIFT) > > > +#define VFIO_TPH_HINT_TARGET_PRIO (3 << VFIO_TPH_HINT_SHIFT) > > > > There needs to be a __u8 padding in here somewhere or flags extended to > > __u16. > > > > > + __u16 index; /* MSI-X/ST-table index to set ST */ > > > + /* out */ > > > + __u16 st; /* Steering-Tag */ > > > > Sorry if I'm just not familiar with TPH, but why do we need to return the ST? > > Doesn't hardware make use of the ST index and the physical value gets applied > > automatically in HW? > > > > Device specific mode requires this. For instance, intel E810 NIC can set > Steering-tags in queue context which is in user-space when used with DPDK. > For such use cases we must return steering-tags read from the _DSM. > Going back to DPDK RFC mentioned here, if TPH gets enabled in VFIO, the E810 > poll mode driver in the DPDK will ask the kernel a steering-tag for a combination > of a cpu-id, a cache-level and a processing-hint. In response the kernel will > invoke the ACPI _DSM for the root port of the device via pcie_tph_get_cpu_st() > and return the steering-tag back to the user. E810 driver will set the returned tag > in appropriate queue context. > > For cases where steering-tag is not required in user-space (VMMs), caller asks the > kernel to set steering-tag that corresponds the cpu-id, cache-level, and PH > combination at a given MSI-X vector entry or ST table in config space. For such > cases too, the kernel will read the steering-tag from the _DSM and set the tag > in the corresponding MSI-X or ST table entry. We need to consider that there are vfio-pci variant drivers that support migration and make use of the vfio-pci-core feature interface. TPH programming appears to be very non-migration friendly, the attributes are very specific to the current host system. Should migration and TPH be mutually exclusive? This may be a factor in the decision to use a feature or ioctl. > > > + __u8 ph_ignore; /* Processing hint was ignored by */ > > > > Padding/alignment, same as above. "ignored by..." By what? Is that an error? > > > > An error. "Processing hint was ignored by the platform" > > > > + __s32 err; /* Error on getting/setting Steering- > > Tag*/ > > > +}; > > > > Generally we'd expect a system call either works or fails. Why do we need per > > entry error report? Can't we validate and prepare the entire operation before > > setting any of it into the device? Ultimately we're just writing hints to config > > space or MSI-X table space, so the write operation itself is not likely to be the > > point of failure. > > > > Write to MSI-X won't fail but reading the steering-tag from the _DSM for an > invalid <cpu_id, cach_id, ph_hint> combo can fail in both GET_ST and SET_ST > cases. > We can change this to an all or nothing interface, such that success if all the > entries are valid and otherwise if at least one is invalid. But in that case the > caller may not know which entries were invalid. If you think that's ok, I can > change it. It's not exactly uncommon for an ioctl to fail for a single bad arg among many. I think it's harder for userspace to deal with recovering from a partially implemented call. > > > + > > > +struct vfio_pci_tph { > > > + __u32 argsz; /* Size of vfio_pci_tph and info[] */ > > > + __u32 flags; > > > +#define VFIO_DEVICE_FEATURE_TPH_OP_MASK 0x7 > > > +#define VFIO_DEVICE_FEATURE_TPH_OP_SHIFT 3 > > > +#define VFIO_DEVICE_FEATURE_TPH_ENABLE 0 /* > > Enable TPH on device */ > > > +#define VFIO_DEVICE_FEATURE_TPH_DISABLE 1 /* Disable TPH > > on device */ > > > +#define VFIO_DEVICE_FEATURE_TPH_GET_ST 2 /* Get > > steering-tags */ > > > +#define VFIO_DEVICE_FEATURE_TPH_SET_ST 4 /* Set > > steering-rags */ > > > > s/rags/tags/ > > > > vfio device features already have GET and SET as part of their base structure, why > > are they duplicated here? It really seems like there are two separate features > > here, one that allows enabling with a given mode or disable, and another that > > allows writing specific steering tags. Both seem like they could be SET-only > > features. It's also not clear to me that there's a significant advantage to providing > > an array of steering tags. Isn't updating STs an infrequent operation and likely > > bound to at most 2K tags in the case of MSI-X? > > > > VFIO_DEVICE_FEATURE_TPH_ENABLE and VFIO_DEVICE_FEATURE_TPH_DISABLE > are SET features. > Since, VFIO_DEVICE_FEATURE_TPH_SET_ST requires writing to ST table, so that too > falls under SET features. > Therefore, these flags are only valid when VFIO_DEVICE_FEATURE_SET flag is set. > The only GET features use case here is VFIO_DEVICE_FEATURE_TPH_GET_ST that > reads the steering-tags from the _DSM and returns it back to caller. Only valid > with VFIO_DEVICE_FEATURE_GET flag. > These are checked in vfio_pci_feature_tph_prepare(). > > As I mentioned earlier, does it make sense to leave enable/disable under VFIO > Feature and move GET_ST and SET_ST to a separate ioctl? Splitting TPH between a feature and new ioctls doesn't seem like a logical interface. > There isn't much rational to providing an array of tuples other than following > the format VFIO_DEVICE_SET_IRQS that sets eventfds. > > > > + > > > +#define VFIO_TPH_ST_MODE_MASK (0x3 << > > VFIO_DEVICE_FEATURE_TPH_OP_SHIFT) > > > +#define VFIO_TPH_ST_NS_MODE (0 << > > VFIO_DEVICE_FEATURE_TPH_OP_SHIFT) > > > +#define VFIO_TPH_ST_IV_MODE (1 << > > VFIO_DEVICE_FEATURE_TPH_OP_SHIFT) > > > +#define VFIO_TPH_ST_DS_MODE (2 << > > VFIO_DEVICE_FEATURE_TPH_OP_SHIFT) > > > + __u32 count; /* Number of entries in info[] > > */ > > > + struct vfio_pci_tph_info info[]; > > > +#define VFIO_TPH_INFO_MAX 64 /* Max entries allowed > > in info[] */ > > > > This seems to match the limit if the table is located in extended config space, but > > it's not particularly relevant if the table is located in MSI-X space. Why is this a > > good limit? > > Yes, for MSI-X the caller may have to invoke the ioctl multiple times if more > than 64 entries need to be updated. > > > > > Also, try to keep these all in 80 column. Thanks, > > > > I will, try to keep these under 80 columns. > > Thanks Alex, for your input. Let me know what you think about moving SET_ST > and GET_ST to an ioctl and keep ENABLE/DISABLE TPH as a vfio feature. It seems like kind of a hodgepodge to split the interface like that, imo. Sorry for a less that complete reply, I'm on PTO this week. Thanks, Alex