On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > > > vfio will hold device lock and notify userspace of the removal. If > > > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > > > same thread is responsible for releasing the device to vfio, it results in > > > > a deadlock. > > > > > > > > The proper way to detect a change to the num_VFs value is to listen for a > > > > sysfs event, not to add a device_lock() on the attribute _show() in the > > > > kernel. > > The lock was not about detecting a change; Pierre did this: > > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ > cat ${path}/device/sriov_numvfs; \ > > which I assume works by listening for sysfs events. It is not, "ip monitor ..." listens to netlink events emitted by netdev core and not sysfs events. Sysfs events are not involved in this case. > The problem was that after the event occurred, the sriov_numvfs > read got a stale value (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). Yes, and it is outcome of such cross-subsytem involvement, which is racy by definition. Someone can come with even simpler example of why locking sysfs read and write is not a good idea. For example, let's consider the following scenario with two CPUs and locks on sysfs read and write: CPU1 CPU2 echo 1 > ${path}/device/sriov_numvfs context_switch -> cat ${path}/device/sriov_numvfs lock return 0 unlock context_switch <- lock set 1 unlock CPU1 CPU2 echo 1 > ${path}/device/sriov_numvfs lock set 1 unlock context_switch -> cat ${path}/device/sriov_numvfs lock return 1 unlock So same scenario will return different values if user doesn't protect such case with external to the kernel lock. But if we return back to Pierre report and if you want to provide completely bullet proof solution to solve cross-subsystem interaction, you will need to prohibit device probe till sriov_numvfs update is completed. However, it is overkill for something that is not a real issue. > > So I would drop this sentence because I don't think it accurately > reflects the reason for 35ff867b7657. > > > > Since you are reverting a commit that synchronizes SysFS read > > > /write, please add some comments about why it is not an > > > issue anymore. > > > > It was never an issue, the idea that sysfs read and write should be > > serialized by kernel is not correct by definition. > > I think it *was* an issue. The behavior Pierre observed at was > clearly wrong, I disagree with this sentence. > and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs > sriov_numvfs reads vs writes") to resolve it. > > We should try to avoid reintroducing the problem, so I think we should > probably squash these two patches and describe it as a deadlock fix > instead of dismissing 35ff867b7657 as being based on false premises. > > It would be awesome if you had time to verify that these patches also > resolve the problem you saw, Pierre. They won't resolve his problem, because he is not listening to sysfs events, but rely on something from netdev side. > > I think we should also add: > > Fixes: 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") > > as a trigger for backporting this to kernels that include > 35ff867b7657. > > Bjorn > > > > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > > > > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > > > > > > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > > > > Suggested-by: Leon Romanovsky <leonro@xxxxxxxxxx> > > > > Reviewed-by: Leon Romanovsky <leonro@xxxxxxxxxx> > > > > Signed-off-by: Jim Harris <jim.harris@xxxxxxxxxxx> > > > > --- > > > > drivers/pci/iov.c | 8 +------- > > > > 1 file changed, 1 insertion(+), 7 deletions(-) > > > > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > > > index aaa33e8dc4c9..0ca20cd518d5 100644 > > > > --- a/drivers/pci/iov.c > > > > +++ b/drivers/pci/iov.c > > > > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > > > > char *buf) > > > > { > > > > struct pci_dev *pdev = to_pci_dev(dev); > > > > - u16 num_vfs; > > > > - > > > > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > > > > - device_lock(&pdev->dev); > > > > - num_vfs = pdev->sriov->num_VFs; > > > > - device_unlock(&pdev->dev); > > > > > > > > - return sysfs_emit(buf, "%u\n", num_vfs); > > > > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > > > > } > > > > > > > > /* > > > > > > > -- > > > Sathyanarayanan Kuppuswamy > > > Linux Kernel Developer > > > >