Re: [PATCH 1/2] PCI: ASPM exit link state code could skip devices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Feb 24, 2013 at 10:59 PM, Gu Zheng <guz.fnst@xxxxxxxxxxxxxx> wrote:
> On 02/24/2013 08:20 AM, Bjorn Helgaas wrote:
>
>> [+cc Yinghai]
>>
>> On Mon, Feb 18, 2013 at 8:33 PM, Gu Zheng <guz.fnst@xxxxxxxxxxxxxx> wrote:
>>> Resend mail for the network problem, if you have received, please ignore it.
>>>
>>> On 02/09/2013 08:35 AM, Bjorn Helgaas wrote:
>>>
>>>> On Wed, Feb 6, 2013 at 3:09 AM, Gu Zheng <guz.fnst@xxxxxxxxxxxxxx> wrote:
>>>>> On 01/19/2013 02:23 AM, Joe Lawrence wrote:
>>>>>
>>>>>> >From 3a51bbad5555ba6c6e144aa5176c8112eb449325 Mon Sep 17 00:00:00 2001
>>>>>> From: Joe Lawrence <joe.lawrence@xxxxxxxxxxx>
>>>>>> Date: Tue, 15 Jan 2013 14:51:57 -0500
>>>>>> Subject: [PATCH 1/2] PCI: ASPM exit link state code could skip devices
>>>>>>
>>>>>> On PCI bus hotplug removal, pcie_aspm_exit_link_state could potentially skip
>>>>>> parent devices that have link_state allocated.  Instead of exiting early if
>>>>>> a given device is not PCIe, check that the device's parent is PCIe.  This
>>>>>> enables pcie_aspm_exit_link_state to properly clean up parent link_state when
>>>>>> the last function in a slot might not be PCIe.
>>>>>>
>>>>>> Reviewed-by: David Bulkow <david.bulkow@xxxxxxxxxxx>
>>>>>> Signed-off-by: Joe Lawrence <joe.lawrence@xxxxxxxxxxx>
>>>>>> ---
>>>>>>  drivers/pci/pcie/aspm.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
>>>>>> index b52630b..6122447 100644
>>>>>> --- a/drivers/pci/pcie/aspm.c
>>>>>> +++ b/drivers/pci/pcie/aspm.c
>>>>>> @@ -634,7 +634,7 @@ void pcie_aspm_exit_link_state(struct pci_dev *pdev)
>>>>>>       struct pci_dev *parent = pdev->bus->self;
>>>>>>       struct pcie_link_state *link, *root, *parent_link;
>>>>>>
>>>>>> -     if (!pci_is_pcie(pdev) || !parent || !parent->link_state)
>>>>>> +     if (!parent || !pci_is_pcie(parent) || !parent->link_state)
>>>>>>               return;
>>>>>>       if ((pci_pcie_type(parent) != PCI_EXP_TYPE_ROOT_PORT) &&
>>>>>>           (pci_pcie_type(parent) != PCI_EXP_TYPE_DOWNSTREAM))
>>>>>
>>>>> Hi Joe,
>>>>>       We encounter the issue you mentioned above, so I used your patch to try to
>>>>> fix it, but it seems not helpful. Could you please help to look into it ?
>>>
>>>
>>> Hi Bjorn,
>>>         Sorry for the later reply, we were on vacation in the last week.
>>>
>>>>
>>>> Please try this with a current linux-next tree or at least a tree with
>>>> "PCI/ASPM: Deallocate upstream link state even if device is not PCIe"
>>>> in it, e.g., http://git.kernel.org/?p=linux/kernel/git/helgaas/pci.git;a=shortlog;h=refs/heads/next
>>>>
>>>
>>>
>>> OK, I'll take it when my test machine is OK.
>>>
>>>> Can you collect  the complete "lspci -vv" output as root?
>>>
>>>
>>> Sure, my test machine has a lot of hardware, the info is send as an attachment.
>>>
>>>>
>>>> Do you need both "remove" actions in your test script to cause the
>>>> crash?  Removing 10:00.0 should remove the whole tree below it,
>>>> including 1a:01.0, so I would think the 1a:01.0 remove would just
>>>> fail.
>>>
>>>
>>> The test script is used to test the parallel remove routines trigged by sysfs/pci interface.
>>> So the issue we mentioned is a extra discovery.
>>> As you said, when the front part script "echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove"
>>> runs over, the whole tree below 10:00.0 should be removed already, so the latter part script "
>>> echo -n 1 >  /sys/bus/pci/devices/0000\:1a\:01.0/remove" wants to remove 1a:01.0 would fail.
>>> But it does not work as we said indeed, it also enters the remove routine, and rises a fault.
>>>
>>> I print some debug info when run the test script. When the latter part remove routine works into
>>> pcie_aspm_exit_link_state(), the *parent* is set to 0x6b6b6b6b6b6b6b6b, POISON_FREE, because of
>>> the front remove routine has free it. So the check "if (!parent || !parent->link_state)" will rise
>>> a fault. Maybe we need to add a check whether *parent* is freed.
>>>
>>> Thanks,
>>> Gu
>>>
>>>>
>>>> Also, if you can do "disassemble pcie_aspm_exit_link_state" in gdb, we
>>>> can figure out exactly where the fault is.
>>>>
>>>> It might be easiest to open a bugzilla and attach the complete dmesg
>>>> log, lspci output, disassembly, etc., there.
>>>
>>>
>>> OK, I'll create a bugzilla later.
>>>
>>>>
>>>>> *test script*
>>>>> echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1 >  /sys/bus/pci/devices/0000\:1a\:01.0/remove
>>>>>
>>>>> *pci topology tree*
>>>>> +-09.0-[10-1e]----00.0-[11-1e]--+-00.0-[12-18]----00.0-[13-18]--+-00.0-[14]--+-00.0
>>>>>              |                               |                               |            \-00.1
>>>>>              |                               |                               +-01.0-[15]--+-00.0
>>>>>              |                               |                               |            \-00.1
>>>>>              |                               |                               +-02.0-[16]----00.0
>>>>>              |                               |                               +-03.0-[17]----00.0
>>>>>              |                               |                               \-04.0-[18]--
>>>>>              |                               \-01.0-[19-1e]----00.0-[1a-1e]--+-00.0-[1b]--
>>>>>              |                                                               +-01.0-[1c]--+-00.0
>>>>>              |                                                               |            \-00.1
>>>>>              |                                                               +-02.0-[1d]--
>>>>>              |                                                               \-03.0-[1e]--
>>>>>
>>>>> $ lspci -vs 10:00.0
>>>>> 10:00.0 PCI bridge: Integrated Device Technology, Inc. Device 807f (rev 02) (prog-if 00 [Normal decode])
>>>>>         Flags: bus master, fast devsel, latency 0
>>>>>         Bus: primary=10, secondary=11, subordinate=1e, sec-latency=0
>>>>>         I/O behind bridge: 00001000-00005fff
>>>>>         Memory behind bridge: 92a00000-937fffff
>>>>>         Prefetchable memory behind bridge: 0000000092200000-00000000929fffff
>>>>>         Capabilities: <access denied>
>>>>>         Kernel driver in use: pcieport
>>>>>         Kernel modules: shpchp
>>>>>
>>>>> $ lspci -vs 1a:01.0
>>>>> 1a:01.0 PCI bridge: Integrated Device Technology, Inc. Device 807f (rev 02) (prog-if 00 [Normal decode])
>>>>>         Flags: bus master, fast devsel, latency 0
>>>>>         Bus: primary=1a, secondary=1c, subordinate=1c, sec-latency=0
>>>>>         I/O behind bridge: 00001000-00001fff
>>>>>         Memory behind bridge: 92e00000-930fffff
>>>>>         Prefetchable memory behind bridge: 0000000092600000-00000000927fffff
>>>>>         Capabilities: <access denied>
>>>>>         Kernel driver in use: pcieport
>>>>>         Kernel modules: shpchp
>>>>>
>>>>> *dmesg info*
>>>>> [  328.037479] general protection fault: 0000 [#1] SMP
>>>>> [  328.096991] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr lpc_ich mfd_core sg i2c_i801 i2c_core ioatdma i7core_edac edac_core e1000e igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
>>>>> [  328.697122] CPU 6
>>>>> [  328.719040] Pid: 6, comm: kworker/u:0 Tainted: G        W    3.8.0-rc6-aspm-pcie-fix+ #58 FUJITSU-SV PRIMEQUEST 1800E/SB
>>>>> [  328.851117] RIP: 0010:[<ffffffff813928f8>]  [<ffffffff813928f8>] pcie_aspm_exit_link_state+0x38/0x190
>>>>> [  328.961428] RSP: 0018:ffff8807bde17c48  EFLAGS: 00010202
>>>>> [  329.024874] RAX: ffff8807bb4a1290 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000006
>>>>> [  329.110125] RDX: 0000000000000006 RSI: ffff8807bde1afc8 RDI: 0000000000000246
>>>>> [  329.195371] RBP: ffff8807bde17c68 R08: 0000000000000001 R09: 0000000000000001
>>>>> [  329.280619] R10: 0000000000000003 R11: 0000000000000001 R12: ffff8807bb49b3d8
>>>>> [  329.365869] R13: ffff8807bb49b3d8 R14: ffffffff82126d80 R15: ffff8807bde17d58
>>>>> [  329.451127] FS:  0000000000000000(0000) GS:ffff8807c2600000(0000) knlGS:0000000000000000
>>>>> [  329.547796] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>> [  329.616431] CR2: ffffffffff600400 CR3: 0000000001c0c000 CR4: 00000000000007e0
>>>>> [  329.701687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>> [  329.786935] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>>>> [  329.872185] Process kworker/u:0 (pid: 6, threadinfo ffff8807bde16000, task ffff8807bde1a680)
>>>>> [  329.973006] Stack:
>>>>> [  329.997000]  0000000000000006 ffff8807bb49b3d8 0000000000000000 ffff8807bb49b3d8
>>>>> [  330.085822]  ffff8807bde17c88 ffffffff81380f42 2222222222222222 ffff8807bb49b3d8
>>>>> [  330.174627]  ffff8807bde17cb8 ffffffff81380fb4 0000000000000000 ffff8807bb49b3d8
>>>>> [  330.263427] Call Trace:
>>>>> [  330.292616]  [<ffffffff81380f42>] pci_stop_dev+0xb2/0xd0
>>>>> [  330.356064]  [<ffffffff81380fb4>] pci_stop_bus_device+0x54/0x60
>>>>> [  330.426778]  [<ffffffff81381156>] pci_stop_and_remove_bus_device+0x16/0x30
>>>>> [  330.508919]  [<ffffffff8138894b>] remove_callback+0x2b/0x40
>>>>> [  330.575487]  [<ffffffff8125a82a>] sysfs_schedule_callback_work+0x1a/0x80
>>>>> [  330.655551]  [<ffffffff81091b81>] process_one_work+0x241/0x5f0
>>>>> [  330.725228]  [<ffffffff81091b0f>] ? process_one_work+0x1cf/0x5f0
>>>>> [  330.796981]  [<ffffffff8125a810>] ? sysfs_schedule_callback+0x210/0x210
>>>>> [  330.876002]  [<ffffffff81093d3b>] worker_thread+0x12b/0x3f0
>>>>> [  330.942567]  [<ffffffff81093c10>] ? manage_workers+0x180/0x180
>>>>> [  331.012243]  [<ffffffff81099f9e>] kthread+0xee/0x100
>>>>> [  331.071546]  [<ffffffff810e1839>] ? __lock_release+0x129/0x190
>>>>> [  331.141223]  [<ffffffff81099eb0>] ? __init_kthread_worker+0x70/0x70
>>>>> [  331.216099]  [<ffffffff816b8aec>] ret_from_fork+0x7c/0xb0
>>>>> [  331.280585]  [<ffffffff81099eb0>] ? __init_kthread_worker+0x70/0x70
>>>>> [  331.355453] Code: 89 65 f0 4c 89 6d f8 66 66 66 66 90 31 c0 49 89 fc 48 c7 c7 35 ee a3 81 e8 70 83 31 00 49 8b 44 24 10 48 8b 58 38 48 85 db 74 48 <80> 7b 4a 00 74 42 48 83 bb 88 00 00 00 00 74 38 31 c0 48 c7 c7
>>>>> [  331.587982] RIP  [<ffffffff813928f8>] pcie_aspm_exit_link_state+0x38/0x190
>>>>> [  331.670227]  RSP <ffff8807bde17c48>
>>>>> [  331.711935] ---[ end trace 359d14e0593f23af ]---
>>>>> [  331.767128] Kernel panic - not syncing: Fatal exception
>>>>> [  331.829701] ------------[ cut here ]------------
>>>>> [  331.884839] WARNING: at arch/x86/kernel/smp.c:123 native_smp_send_reschedule+0x5c/0x60()
>>>>> [  331.981506] Hardware name: PRIMEQUEST 1800E
>>>>> [  332.031449] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr lpc_ich mfd_core sg i2c_i801 i2c_core ioatdma i7core_edac edac_core e1000e igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
>>>>> [  332.631448] Pid: 6, comm: kworker/u:0 Tainted: G      D W    3.8.0-rc6-aspm-pcie-fix+ #58
>>>>> [  332.729156] Call Trace:
>>>>> [  332.758334]  <IRQ>  [<ffffffff8106dd5f>] warn_slowpath_common+0x7f/0xc0
>>>>> [  332.837472]  [<ffffffff8106ddba>] warn_slowpath_null+0x1a/0x20
>>>>> [  332.907144]  [<ffffffff8103db0c>] native_smp_send_reschedule+0x5c/0x60
>>>>> [  332.985129]  [<ffffffff810bc027>] trigger_load_balance+0x357/0x4f0
>>>>> [  333.058957]  [<ffffffff810aab76>] scheduler_tick+0x116/0x150
>>>>> [  333.126557]  [<ffffffff8108093e>] update_process_times+0x6e/0x90
>>>>> [  333.198305]  [<ffffffff810d8359>] tick_sched_handle+0x39/0x80
>>>>> [  333.266939]  [<ffffffff810d8584>] tick_sched_timer+0x54/0x90
>>>>> [  333.334541]  [<ffffffff8109f613>] __run_hrtimer+0x83/0x320
>>>>> [  333.400060]  [<ffffffff810d8530>] ? tick_nohz_handler+0xc0/0xc0
>>>>> [  333.470773]  [<ffffffff8109fb56>] hrtimer_interrupt+0x106/0x280
>>>>> [  333.541489]  [<ffffffff810b3fe7>] ? irqtime_account_irq+0xe7/0x100
>>>>> [  333.615316]  [<ffffffff816ba949>] smp_apic_timer_interrupt+0x69/0x99
>>>>> [  333.691221]  [<ffffffff816b9872>] apic_timer_interrupt+0x72/0x80
>>>>> [  333.762968]  <EOI>  [<ffffffff816aab60>] ? panic+0x1a6/0x1ee
>>>>> [  333.830680]  [<ffffffff816aab5c>] ? panic+0x1a2/0x1ee
>>>>> [  333.891012]  [<ffffffff81071ca8>] ? kmsg_dump+0x1d8/0x2a0
>>>>> [  333.955492]  [<ffffffff81071af6>] ? kmsg_dump+0x26/0x2a0
>>>>> [  334.018937]  [<ffffffff81071c90>] ? kmsg_dump+0x1c0/0x2a0
>>>>> [  334.083424]  [<ffffffff816b022c>] oops_end+0xdc/0xf0
>>>>> [  334.142717]  [<ffffffff8101aa8b>] die+0x5b/0x90
>>>>> [  334.196816]  [<ffffffff816afe0c>] do_general_protection+0xdc/0x160
>>>>> [  334.270643]  [<ffffffff816af2a3>] ? restore_args+0x30/0x30
>>>>> [  334.336165]  [<ffffffff816af518>] general_protection+0x28/0x30
>>>>> [  334.405839]  [<ffffffff813928f8>] ? pcie_aspm_exit_link_state+0x38/0x190
>>>>> [  334.485897]  [<ffffffff813928ea>] ? pcie_aspm_exit_link_state+0x2a/0x190
>>>>> [  334.565955]  [<ffffffff81380f42>] pci_stop_dev+0xb2/0xd0
>>>>> [  334.629398]  [<ffffffff81380fb4>] pci_stop_bus_device+0x54/0x60
>>>>> [  334.700114]  [<ffffffff81381156>] pci_stop_and_remove_bus_device+0x16/0x30
>>>>> [  334.782248]  [<ffffffff8138894b>] remove_callback+0x2b/0x40
>>>>> [  334.848807]  [<ffffffff8125a82a>] sysfs_schedule_callback_work+0x1a/0x80
>>>>> [  334.928863]  [<ffffffff81091b81>] process_one_work+0x241/0x5f0
>>>>> [  334.998539]  [<ffffffff81091b0f>] ? process_one_work+0x1cf/0x5f0
>>>>> [  335.070290]  [<ffffffff8125a810>] ? sysfs_schedule_callback+0x210/0x210
>>>>> [  335.149311]  [<ffffffff81093d3b>] worker_thread+0x12b/0x3f0
>>>>> [  335.215870]  [<ffffffff81093c10>] ? manage_workers+0x180/0x180
>>>>> [  335.285544]  [<ffffffff81099f9e>] kthread+0xee/0x100
>>>>> [  335.344837]  [<ffffffff810e1839>] ? __lock_release+0x129/0x190
>>>>> [  335.414511]  [<ffffffff81099eb0>] ? __init_kthread_worker+0x70/0x70
>>>>> [  335.489379]  [<ffffffff816b8aec>] ret_from_fork+0x7c/0xb0
>>>>> [  335.553860]  [<ffffffff81099eb0>] ? __init_kthread_worker+0x70/0x70
>>>>> [  335.628727] ---[ end trace 359d14e0593f23b0 ]---
>>
>> Please create a bugzilla for this issue.
>>
>
> Here:https://bugzilla.kernel.org/show_bug.cgi?id=54411
>
>> I think this is a general object lifetime issue that really has
>> nothing to do with ASPM except that ASPM happens to be the victim.
>>
>> You're doing this:
>>
>>     echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1
>>>  /sys/bus/pci/devices/0000\:1a\:01.0/remove
>>
>> The 1a:01.0 device is downstream from the 10:00.0 bridge.  The sysfs
>> interface remove_store() uses device_schedule_callback() to schedule
>> the remove for later.  I think what's happening is that we schedule
>> remove_callback() for both devices before 10:00.0 has been removed,
>> like this:
>>
>>     # echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove
>>     remove_store  # for 10:00.0
>>       device_schedule_callback(10:00.0, remove_callback)
>>         sysfs_schedule_callback
>>           kobject_get
>>           queue_work
>>     # echo -n 1 >  /sys/bus/pci/devices/0000\:1a\:01.0/remove
>>     remove_store  # for 1a:01.0
>>       device_schedule_callback(1a:01.0, remove_callback)
>>         sysfs_schedule_callback
>>           kobject_get
>>           queue_work
>>
>> Note that we acquire a reference on each pci_dev before queuing the work item.
>>
>> Later, we run the callbacks, starting with 10:00.0.  This calls
>> remove_callback() to perform the remove:
>>
>>     remove_callback(10:00.0)
>>       mutex_lock(&pci_remove_rescan_mutex)
>>       pci_stop_and_remove_bus_device(pdev)
>>       mutex_unlock(&pci_remove_rescan_mutex)
>>
>> This will stop and remove the subtree below 10:00.0, but it does not
>> actually free the pci_dev for 1a:01.0 because we increased its ref
>> count in sysfs_schedule_callback.  So after completing
>> remove_callback(10:00.0), we run the second callback for 1a:01.0.
>>
>> The remove for 1a:01.0 calls pcie_aspm_exit_link_state() from
>> pci_stop_dev().  This is where we blow up because, according to your
>> debugging, pdev->bus->self is no longer valid.
>>
>> The PCI core did this removal wrong.  If we have a valid pci_dev
>> pointer, as we do in pcie_aspm_exit_link_state(), the whole object
>> ought to be valid.  But the PCI core deallocated the struct pci_bus
>> for bus 0000:1a too soon.
>
> Your analysis is perfect, and it solves my doubt. Thanks very much!:)

Gu:

I can't tell from your response whether you are stating that you agree
with Bjorn's analysis -or- that you applied Yinghai's patch above
(dated Feb 23) and found that the testing scenario was now successful.
 Could you please specifically state which and if it was just the
former would you be able to test Yinghai's implementation of Bjorn's
analysis?

Thanks,
 Myron
>
>>
>> My guess is that when we build a pci_dev, we need to increase the ref
>> count on the pci_bus where that pci_dev lives.  That way we can keep
>> around all the buses and bridges leading from the root to the device
>> in question.
>>
>> Bjorn
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux