[+cc Rafael, in case you're interested in the driver core issue here] On Wed, Mar 31, 2021 at 07:08:07AM +0300, Leon Romanovsky wrote: > On Tue, Mar 30, 2021 at 03:41:41PM -0500, Bjorn Helgaas wrote: > > On Tue, Mar 30, 2021 at 04:47:16PM -0300, Jason Gunthorpe wrote: > > > On Tue, Mar 30, 2021 at 10:00:19AM -0500, Bjorn Helgaas wrote: > > > > On Tue, Mar 30, 2021 at 10:57:38AM -0300, Jason Gunthorpe wrote: > > > > > On Mon, Mar 29, 2021 at 08:29:49PM -0500, Bjorn Helgaas wrote: > > > > > > > > > > > I think I misunderstood Greg's subdirectory comment. We already have > > > > > > directories like this: > > > > > > > > > > Yes, IIRC, Greg's remark applies if you have to start creating > > > > > directories with manual kobjects. > > > > > > > > > > > and aspm_ctrl_attr_group (for "link") is nicely done with static > > > > > > attributes. So I think we could do something like this: > > > > > > > > > > > > /sys/bus/pci/devices/0000:01:00.0/ # PF directory > > > > > > sriov/ # SR-IOV related stuff > > > > > > vf_total_msix > > > > > > vf_msix_count_BB:DD.F # includes bus/dev/fn of first VF > > > > > > ... > > > > > > vf_msix_count_BB:DD.F # includes bus/dev/fn of last VF > > > > > > > > > > It looks a bit odd that it isn't a subdirectory, but this seems > > > > > reasonable. > > > > > > > > Sorry, I missed your point; you'll have to lay it out more explicitly. > > > > I did intend that "sriov" *is* a subdirectory of the 0000:01:00.0 > > > > directory. The full paths would be: > > > > > > > > /sys/bus/pci/devices/0000:01:00.0/sriov/vf_total_msix > > > > /sys/bus/pci/devices/0000:01:00.0/sriov/vf_msix_count_BB:DD.F > > > > ... > > > > > > Sorry, I was meaning what you first proposed: > > > > > > /sys/bus/pci/devices/0000:01:00.0/sriov/BB:DD.F/vf_msix_count > > > > > > Which has the extra sub directory to organize the child VFs. > > > > > > Keep in mind there is going to be alot of VFs here, > 1k - so this > > > will be a huge directory. > > > > With 0000:01:00.0/sriov/vf_msix_count_BB:DD.F, sriov/ will contain > > 1 + 1K files ("vf_total_msix" + 1 per VF). > > > > With 0000:01:00.0/sriov/BB:DD.F/vf_msix_count, sriov/ will contain > > 1 file and 1K subdirectories. > > This is racy by design, in order to add new file and create BB:DD.F > directory, the VF will need to do it after or during it's creation. > During PF creation it is unknown to PF those BB:DD.F values. > > The race here is due to the events of PF,VF directory already sent > but new directory structure is not ready yet. > > From code perspective, we will need to add something similar to > pci_iov_sysfs_link() with the code that you didn't like in previous > variants (the one that messes with sysfs_create_file API). > > It looks not good for large SR-IOV systems with >1K VFs with > gazillion subdirectories inside PF, while the more natural is to see > them in VF. > > So I'm completely puzzled why you want to do these files on PF and > not on VF as v0, v7 and v8 proposed. On both mlx5 and NVMe, the "assign vectors to VF" functionality is implemented by the PF, so I think it's reasonable to explore the idea of "associate the vector assignment sysfs file with the PF." Assume 1K VFs. Either way we have >1K subdirectories of /sys/devices/pci0000:00/. I think we should avoid an extra subdirectory level, so I think the choices on the table are: Associate "vf_msix_count" with the PF: - /sys/.../<PF>/sriov/vf_total_msix # all on PF - /sys/.../<PF>/sriov/vf_msix_count_BB:DD.F (1K of these). Greg says the number of these is not a problem. - The "vf_total_msix" and "vf_msix_count_*" files are all related and are grouped together in PF/sriov/. - The "vf_msix_count_*" files operate directly on the PF. Lock the PF for serialization, lookup and lock the VF to ensure no VF driver, call PF driver callback to assign vectors. - Requires special sysfs code to create/remove "vf_msix_count_*" files when setting/clearing VF Enable. This code could create them only when the PF driver actually supports vector assignment. Unavoidable sysfs/uevent race, see below. Associate "vf_msix_count" with the VF: - /sys/.../<PF>/sriov_vf_total_msix # on PF - /sys/.../<VF>/sriov_vf_msix_count # on each VF - The "sriov_vf_msix_count" files enter via the VF. Lock the VF to ensure no VF driver, lookup and lock the PF for serialization, call PF driver callback to assign vectors. - Can be done with static sysfs attributes. This means creating "sriov_vf_msix_count" *always*, even if PF driver doesn't support vector assignment. IIUC, putting "vf_msix_count_*" under the PF involves a race. When we call device_add() for each new VF, it creates the VF sysfs directory and emits the KOBJ_ADD uevent, but the "vf_msix_count_*" file doesn't exist yet. It can't be created before device_add() because the sysfs directory doesn't exist. If we create it after device_add(), the "add VF" uevent has already been emitted, so userspace may consume it before "vf_msix_count_*" is created. sriov_enable <set VF Enable> <-- VFs created on PCI sriov_add_vfs for (i = 0; i < num_vfs; i++) { pci_iov_add_virtfn pci_device_add device_initialize device_add device_add_attrs <-- add VF sysfs attrs kobject_uevent(KOBJ_ADD) <-- emit uevent <-- add "vf_msix_count_*" sysfs attr pci_iov_sysfs_link pci_bus_add_device pci_create_sysfs_dev_files device_attach } Conceptually, I like having the "vf_total_msix" and "vf_msix_count_*" files associated directly with the PF. I think that's more natural because they both operate directly on the PF. But I don't like the race, and using static attributes seems much cleaner implementation-wise. Bjorn