Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices

Marta Rybczynska <mrybczyn@xxxxxxxxx> · Wed, 21 Mar 2018 17:10:56 +0100 (CET)

> On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote:
>> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
>> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
>> > >> NVMe driver uses threads for the work at device reset, including enabling
>> > >> the PCIe device. When multiple NVMe devices are initialized, their reset
>> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be
>> > >> called in parallel on multiple cores.
>> > >> 
>> > >> This causes a loop of enabling of all upstream bridges in
>> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
>> > >> including __pci_set_master and architecture-specific functions that
>> > >> call ones like and pci_enable_resources(). Both __pci_set_master()
>> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
>> > >> and change it. This is done as read/modify/write.
>> > >> 
>> > >> Imagine that the PCIe tree looks like:
>> > >> A - B - switch -  C - D
>> > >>                \- E - F
>> > >> 
>> > >> D and F are two NVMe disks and all devices from B are not enabled and bus
>> > >> mastering is not set. If their reset work are scheduled in parallel the two
>> > >> modifications of PCI_COMMAND may happen in parallel without locking and the
>> > >> system may end up with the part of PCIe tree not enabled.
>> > > 
>> > > Then looks serialized reset should be used, and I did see the commit
>> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
>> > > to mark controller state' in reset stress test.
>> > > 
>> > > But that commit only covers case of PCI reset from sysfs attribute, and
>> > > maybe other cases need to be dealt with in similar way too.
>> > > 
>> > 
>> > It seems to me that the serialized reset works for multiple resets of the
>> > same device, doesn't it? Our problem is linked to resets of different devices
>> > that share the same PCIe tree.
>> 
>> Given reset shouldn't be a frequent action, it might be fine to serialize all
>> reset from different devices.
> 
> The driver was much simpler when we had serialized resets in line with
> probe, but that had a bigger problems with certain init systems when
> you put enough nvme devices in your server, making them unbootable.
> 
> Would it be okay to serialize just the pci_enable_device across all
> other tasks messing with the PCI topology?
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index cef5ce851a92..e0a2f6c0f1cf 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>	int result = -ENOMEM;
>	struct pci_dev *pdev = to_pci_dev(dev->dev);
> 
> -	if (pci_enable_device_mem(pdev))
> -		return result;
> +	pci_lock_rescan_remove();
> +	result = pci_enable_device_mem(pdev);
> +	pci_unlock_rescan_remove();
> +	if (result)
> +		return -ENODEV;
> 
>	pci_set_master(pdev);
> 

The problem may happen also with other device doing its probe and nvme running its
workqueue (and we probably have seen it in practice too). We were thinking about a lock
in the pci generic code too, that's why I've put the linux-pci@ list in copy.

Marta