Re: [RFC PATCH v2] pci: Concurrency issue in NVMe Init through PCIe switch

Srinath Mannam <srinath.mannam@xxxxxxxxxxxx> · Fri, 23 Jun 2017 21:05:24 +0530

Hi Bjorn,

On Mon, Jun 19, 2017 at 10:32 PM, Srinath Mannam
<srinath.mannam@xxxxxxxxxxxx> wrote:
> Hi Bjorn,
>
> Thank you for the feedback.
>
> On Fri, Jun 16, 2017 at 3:57 AM, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>> Hi Srinath,
>>
>> On Tue, May 30, 2017 at 02:38:17PM +0530, Srinath Mannam wrote:
>>> We found a concurrency issue in NVMe Init when we initialize
>>> multiple NVMe connected over PCIe switch.
>>>
>>> Setup details:
>>>  - SMP system with 8 ARMv8 cores running Linux kernel(4.11).
>>>  - Two NVMe cards are connected to PCIe RC through bridge as shown
>>>    in the below figure.
>>>
>>>                    [RC]
>>>                     |
>>>                  [BRIDGE]
>>>                     |
>>>                -----------
>>>               |           |
>>>             [NVMe]      [NVMe]
>>>
>>> Issue description:
>>> After PCIe enumeration completed NVMe driver probe function called
>>> for both the devices from two CPUS simultaneously.
>>> From nvme_probe, pci_enable_device_mem called for both the EPs. This
>>> function called pci_enable_bridge enable recursively until RC.
>>
>> Let's refine the changelog a little bit by removing details that
>> aren't pertinent.  The fact that this happens with NVMe on ARMv8 is
>> irrelevant.  It could happen on any SMP system.  The critical thing is
>> that drivers for two devices, both below the same disabled bridge,
>> called pci_enable_device() about the same time, and both tried to
>> enable the bridge simultaneously.
>>
> I will modify the changelog as generic.
>
>>> Inside pci_enable_bridge function, at two places concurrency issue is
>>> observed.
>>>
>>> Place 1:
>>>   CPU 0:
>>>     1. Done Atomic increment dev->enable_cnt
>>>        in pci_enable_device_flags
>>>     2. Inside pci_enable_resources
>>>     3. Completed pci_read_config_word(dev, PCI_COMMAND, &cmd)
>>>     4. Ready to set PCI_COMMAND_MEMORY (0x2) in
>>>        pci_write_config_word(dev, PCI_COMMAND, cmd)
>>>   CPU 1:
>>>     1. Check pci_is_enabled in function pci_enable_bridge
>>>        and it is true
>>>     2. Check (!dev->is_busmaster) also true
>>>     3. Gone into pci_set_master
>>>     4. Completed pci_read_config_word(dev, PCI_COMMAND, &old_cmd)
>>>     5. Ready to set PCI_COMMAND_MASTER (0x4) in
>>>        pci_write_config_word(dev, PCI_COMMAND, cmd)
>>>
>>> By the time of last point for both the CPUs are read value 0 and
>>> ready to write 2 and 4.
>>> After last point final value in PCI_COMMAND register is 4 instead of 6.
>>>
>>> Place 2:
>>>   CPU 0:
>>>     1. Done Atomic increment dev->enable_cnt in
>>>        pci_enable_device_flags
>>>
>>> Signed-off-by: Srinath Mannam <srinath.mannam@xxxxxxxxxxxx>
>>> ---
>>> Changes since v1:
>>>  - Used mutex to syncronize pci_enable_bridge
>>>
>>>  drivers/pci/pci.c   | 4 ++++
>>>  drivers/pci/probe.c | 1 +
>>>  include/linux/pci.h | 1 +
>>>  3 files changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>> index b01bd5b..5bff3e7 100644
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -1347,7 +1347,9 @@ static void pci_enable_bridge(struct pci_dev *dev)
>>>  {
>>>       struct pci_dev *bridge;
>>>       int retval;
>>> +     struct mutex *lock = &dev->bridge_lock;
>>>
>>> +     mutex_lock(lock);
>>
>> I don't think it's necessary to hold the lock until we call
>> pci_set_master() or pci_enable_device(), is it?  E.g., we shouldn't
>> need to hold the lock for "dev" while we call pci_enable_bridge() for
>> its upstream bridge.
> We see the issue because of pci_set_master() and pci_enable_device()
> are not syncronous. So we must take the lock for both pci_set_master() and
> pci_enable_device(). As per your suggestion given in previous patch to avoid
> concurrency using mutex in the struct pci_host_bridge causes dead lock,
> because pci_enable_device_flags is calling recursively through pci_host_bridge.
> So I have taken separate lock for each pcie bridge. In the case of common bridge
> enable only this bridge lock will be taken by more than one CPUs.
>
> We can take lock after pci_enable_bridge function call also.
>
>              [BR0] (Host bridge)
>                  |
>               [BR1]
>                  |
>               [BR2]
>                 |
>        -----------------
>       |                    |
>   [BR3]               [BR4]
>      |                     |
>   [DEV1]           [DEV2]
>
> In the above case, lock tried by both cpus only at BR2, BR1 and BR0 because
> they are common for both the devices. If we take the lock before
> pci_enable_bridge
> one CPU say CPU0 get the lock and CPU1 is waiting until first CPU completes
> all the pci_set_master() and pci_enable_device() completed for all the remaining
> common bridges.
> If we take lock after pci_enable_bridge both CPUs traverse until BR0.
> one CPU say
> CPU0 get the lock CPU1 wait for the lock until bridge initialization completes.
> after bridge enable both CPUs return back to pci_enable_bridge function
> to enable BR1, then CPU0 will take the lock CPU1 will wait and both CPUs
> return to BR2 then CPU0 will take the lock and CPU1 will wait for the lock.
> this case little CPU1 execution is more except everything is fine.
>
>>
>>>       bridge = pci_upstream_bridge(dev);
>>>       if (bridge)
>>>               pci_enable_bridge(bridge);
>>> @@ -1355,6 +1357,7 @@ static void pci_enable_bridge(struct pci_dev *dev)
>>>       if (pci_is_enabled(dev)) {
>>>               if (!dev->is_busmaster)
>>>                       pci_set_master(dev);
>>> +             mutex_unlock(lock);
>>
>> It's not a big deal either way, but I probably would write this with a
>> single unlock at the end and a goto here.
> I will add goto here.
>>
>>>               return;
>>>       }
>>>
>>> @@ -1363,6 +1366,7 @@ static void pci_enable_bridge(struct pci_dev *dev)
>>>               dev_err(&dev->dev, "Error enabling bridge (%d), continuing\n",
>>>                       retval);
>>>       pci_set_master(dev);
>>> +     mutex_unlock(lock);
>>>  }
>>>
>>>  static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
>>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>>> index 19c8950..1c25d1c 100644
>>> --- a/drivers/pci/probe.c
>>> +++ b/drivers/pci/probe.c
>>> @@ -880,6 +880,7 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
>>>       child->dev.parent = child->bridge;
>>>       pci_set_bus_of_node(child);
>>>       pci_set_bus_speed(child);
>>> +     mutex_init(&bridge->bridge_lock);
>>>
>>>       /* Set up default resource pointers and names.. */
>>>       for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) {
>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>> index 33c2b0b..7e88f41 100644
>>> --- a/include/linux/pci.h
>>> +++ b/include/linux/pci.h
>>> @@ -266,6 +266,7 @@ struct pci_dev {
>>>       void            *sysdata;       /* hook for sys-specific extension */
>>>       struct proc_dir_entry *procent; /* device entry in /proc/bus/pci */
>>>       struct pci_slot *slot;          /* Physical slot this device is in */
>>> +     struct mutex bridge_lock;
>>
>> I don't really like adding a per-device lock just for this unusual
>> case.  Can you use the existing device_lock() instead?
>>
> Yes using device_lock is good idea.
> with this lock we don't have issues of nexted locking in this context.
> I will update the code.
After taking device_lock in the pci_enable_bridge function no other
callee functions
is taking nexted lock.
But the pci_enable_bridge is called in the context of the driver probe function,
we will have nexted lock problem.

few pcie drivers called pci_enable_device function from driver probe.
In this context device_lock is already taken by the driver probe
caller function.
My previous mail comments were based on the tests in the drivers where
pci_enable_device function not called in driver probe.

>
>>>       unsigned int    devfn;          /* encoded device & function index */
>>>       unsigned short  vendor;
>>> --
>>> 2.7.4
>>>
>
> Regards,
> srinath
Regards,
Srinath.