Hi Bjorn, On Mon, Jun 19, 2017 at 10:32 PM, Srinath Mannam <srinath.mannam@xxxxxxxxxxxx> wrote: > Hi Bjorn, > > Thank you for the feedback. > > On Fri, Jun 16, 2017 at 3:57 AM, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: >> Hi Srinath, >> >> On Tue, May 30, 2017 at 02:38:17PM +0530, Srinath Mannam wrote: >>> We found a concurrency issue in NVMe Init when we initialize >>> multiple NVMe connected over PCIe switch. >>> >>> Setup details: >>> - SMP system with 8 ARMv8 cores running Linux kernel(4.11). >>> - Two NVMe cards are connected to PCIe RC through bridge as shown >>> in the below figure. >>> >>> [RC] >>> | >>> [BRIDGE] >>> | >>> ----------- >>> | | >>> [NVMe] [NVMe] >>> >>> Issue description: >>> After PCIe enumeration completed NVMe driver probe function called >>> for both the devices from two CPUS simultaneously. >>> From nvme_probe, pci_enable_device_mem called for both the EPs. This >>> function called pci_enable_bridge enable recursively until RC. >> >> Let's refine the changelog a little bit by removing details that >> aren't pertinent. The fact that this happens with NVMe on ARMv8 is >> irrelevant. It could happen on any SMP system. The critical thing is >> that drivers for two devices, both below the same disabled bridge, >> called pci_enable_device() about the same time, and both tried to >> enable the bridge simultaneously. >> > I will modify the changelog as generic. > >>> Inside pci_enable_bridge function, at two places concurrency issue is >>> observed. >>> >>> Place 1: >>> CPU 0: >>> 1. Done Atomic increment dev->enable_cnt >>> in pci_enable_device_flags >>> 2. Inside pci_enable_resources >>> 3. Completed pci_read_config_word(dev, PCI_COMMAND, &cmd) >>> 4. Ready to set PCI_COMMAND_MEMORY (0x2) in >>> pci_write_config_word(dev, PCI_COMMAND, cmd) >>> CPU 1: >>> 1. Check pci_is_enabled in function pci_enable_bridge >>> and it is true >>> 2. Check (!dev->is_busmaster) also true >>> 3. Gone into pci_set_master >>> 4. Completed pci_read_config_word(dev, PCI_COMMAND, &old_cmd) >>> 5. Ready to set PCI_COMMAND_MASTER (0x4) in >>> pci_write_config_word(dev, PCI_COMMAND, cmd) >>> >>> By the time of last point for both the CPUs are read value 0 and >>> ready to write 2 and 4. >>> After last point final value in PCI_COMMAND register is 4 instead of 6. >>> >>> Place 2: >>> CPU 0: >>> 1. Done Atomic increment dev->enable_cnt in >>> pci_enable_device_flags >>> >>> Signed-off-by: Srinath Mannam <srinath.mannam@xxxxxxxxxxxx> >>> --- >>> Changes since v1: >>> - Used mutex to syncronize pci_enable_bridge >>> >>> drivers/pci/pci.c | 4 ++++ >>> drivers/pci/probe.c | 1 + >>> include/linux/pci.h | 1 + >>> 3 files changed, 6 insertions(+) >>> >>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>> index b01bd5b..5bff3e7 100644 >>> --- a/drivers/pci/pci.c >>> +++ b/drivers/pci/pci.c >>> @@ -1347,7 +1347,9 @@ static void pci_enable_bridge(struct pci_dev *dev) >>> { >>> struct pci_dev *bridge; >>> int retval; >>> + struct mutex *lock = &dev->bridge_lock; >>> >>> + mutex_lock(lock); >> >> I don't think it's necessary to hold the lock until we call >> pci_set_master() or pci_enable_device(), is it? E.g., we shouldn't >> need to hold the lock for "dev" while we call pci_enable_bridge() for >> its upstream bridge. > We see the issue because of pci_set_master() and pci_enable_device() > are not syncronous. So we must take the lock for both pci_set_master() and > pci_enable_device(). As per your suggestion given in previous patch to avoid > concurrency using mutex in the struct pci_host_bridge causes dead lock, > because pci_enable_device_flags is calling recursively through pci_host_bridge. > So I have taken separate lock for each pcie bridge. In the case of common bridge > enable only this bridge lock will be taken by more than one CPUs. > > We can take lock after pci_enable_bridge function call also. > > [BR0] (Host bridge) > | > [BR1] > | > [BR2] > | > ----------------- > | | > [BR3] [BR4] > | | > [DEV1] [DEV2] > > In the above case, lock tried by both cpus only at BR2, BR1 and BR0 because > they are common for both the devices. If we take the lock before > pci_enable_bridge > one CPU say CPU0 get the lock and CPU1 is waiting until first CPU completes > all the pci_set_master() and pci_enable_device() completed for all the remaining > common bridges. > If we take lock after pci_enable_bridge both CPUs traverse until BR0. > one CPU say > CPU0 get the lock CPU1 wait for the lock until bridge initialization completes. > after bridge enable both CPUs return back to pci_enable_bridge function > to enable BR1, then CPU0 will take the lock CPU1 will wait and both CPUs > return to BR2 then CPU0 will take the lock and CPU1 will wait for the lock. > this case little CPU1 execution is more except everything is fine. > >> >>> bridge = pci_upstream_bridge(dev); >>> if (bridge) >>> pci_enable_bridge(bridge); >>> @@ -1355,6 +1357,7 @@ static void pci_enable_bridge(struct pci_dev *dev) >>> if (pci_is_enabled(dev)) { >>> if (!dev->is_busmaster) >>> pci_set_master(dev); >>> + mutex_unlock(lock); >> >> It's not a big deal either way, but I probably would write this with a >> single unlock at the end and a goto here. > I will add goto here. >> >>> return; >>> } >>> >>> @@ -1363,6 +1366,7 @@ static void pci_enable_bridge(struct pci_dev *dev) >>> dev_err(&dev->dev, "Error enabling bridge (%d), continuing\n", >>> retval); >>> pci_set_master(dev); >>> + mutex_unlock(lock); >>> } >>> >>> static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags) >>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c >>> index 19c8950..1c25d1c 100644 >>> --- a/drivers/pci/probe.c >>> +++ b/drivers/pci/probe.c >>> @@ -880,6 +880,7 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, >>> child->dev.parent = child->bridge; >>> pci_set_bus_of_node(child); >>> pci_set_bus_speed(child); >>> + mutex_init(&bridge->bridge_lock); >>> >>> /* Set up default resource pointers and names.. */ >>> for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) { >>> diff --git a/include/linux/pci.h b/include/linux/pci.h >>> index 33c2b0b..7e88f41 100644 >>> --- a/include/linux/pci.h >>> +++ b/include/linux/pci.h >>> @@ -266,6 +266,7 @@ struct pci_dev { >>> void *sysdata; /* hook for sys-specific extension */ >>> struct proc_dir_entry *procent; /* device entry in /proc/bus/pci */ >>> struct pci_slot *slot; /* Physical slot this device is in */ >>> + struct mutex bridge_lock; >> >> I don't really like adding a per-device lock just for this unusual >> case. Can you use the existing device_lock() instead? >> > Yes using device_lock is good idea. > with this lock we don't have issues of nexted locking in this context. > I will update the code. After taking device_lock in the pci_enable_bridge function no other callee functions is taking nexted lock. But the pci_enable_bridge is called in the context of the driver probe function, we will have nexted lock problem. few pcie drivers called pci_enable_device function from driver probe. In this context device_lock is already taken by the driver probe caller function. My previous mail comments were based on the tests in the drivers where pci_enable_device function not called in driver probe. > >>> unsigned int devfn; /* encoded device & function index */ >>> unsigned short vendor; >>> -- >>> 2.7.4 >>> > > Regards, > srinath Regards, Srinath.