On 9/23/20 2:33 AM, Qian Cai wrote: > On Fri, 2020-08-07 at 12:18 +0200, Cédric Le Goater wrote: >> When a passthrough IO adapter is removed from a pseries machine using >> hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the >> guest OS to clear all page table entries related to the adapter. If >> some are still present, the RTAS call which isolates the PCI slot >> returns error 9001 "valid outstanding translations" and the removal of >> the IO adapter fails. This is because when the PHBs are scanned, Linux >> maps automatically the INTx interrupts in the Linux interrupt number >> space but these are never removed. >> >> To solve this problem, we introduce a PPC platform specific >> pcibios_remove_bus() routine which clears all interrupt mappings when >> the bus is removed. This also clears the associated page table entries >> of the ESB pages when using XIVE. >> >> For this purpose, we record the logical interrupt numbers of the >> mapped interrupt under the PHB structure and let pcibios_remove_bus() >> do the clean up. >> >> Since some PCI adapters, like GPUs, use the "interrupt-map" property >> to describe interrupt mappings other than the legacy INTx interrupts, >> we can not restrict the size of the mapping array to PCI_NUM_INTX. The >> number of interrupt mappings is computed from the "interrupt-map" >> property and the mapping array is allocated accordingly. >> >> Cc: "Oliver O'Halloran" <oohall@xxxxxxxxx> >> Cc: Alexey Kardashevskiy <aik@xxxxxxxxx> >> Signed-off-by: Cédric Le Goater <clg@xxxxxxxx> > > Some syscall fuzzing will trigger this on POWER9 NV where the traces pointed to > this patch. > > .config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config OK. The patch is missing a NULL assignement after kfree() and that might be the issue. I did try PHB removal under PowerNV, so I would like to understand how we managed to remove twice the PCI bus and possibly reproduce. Any chance we could grab what the syscall fuzzer (syzkaller) did ? Thanks, C. > > [ 3574.564109][ T965] ata1.00: disabled > [ 3574.580373][T151472] sd 0:0:0:0: [sdb] Synchronizing SCSI cache > [ 3574.581180][T151472] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=0x04 driverbyte=0x00 > [ 3574.581226][T151472] sd 0:0:0:0: [sdb] Stopping disk > [ 3574.581289][T151472] sd 0:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=0x04 driverbyte=0x00 > [ 3574.611424][ T3019] Read-error on swap-device (254:1:849792) > [ 3574.611685][ T3019] Read-error on swap-device (254:1:914944) > [ 3574.611769][ T3019] Read-error on swap-device (254:1:915072) > [ 3574.611838][ T3019] Read-error on swap-device (254:1:915200) > [ 3574.611926][ T3019] Read-error on swap-device (254:1:915328) > [ 3574.612268][ T3084] Read-error on swap-device (254:1:792576) > [ 3574.612342][ T3084] Read-error on swap-device (254:1:792704) > [ 3574.612757][ T2362] Read-error on swap-device (254:1:957440) > [ 3574.612773][ T2905] Read-error on swap-device (254:1:784128) > [ 3574.613015][ T2362] Read-error on swap-device (254:1:957568) > [ 3574.613160][ T2905] Read-error on swap-device (254:1:784256) > [ 3574.613241][ T2362] Read-error on swap-device (254:1:957696) > [ 3574.613342][ T2362] Read-error on swap-device (254:1:957824) > [ 3574.614448][ T3019] Core dump to |/usr/lib/systemd/systemd-coredump pipe failed > [ 3574.614663][ T3019] Read-error on swap-device (254:1:961536) > [ 3574.675330][T151844] Read-error on swap-device (254:1:128) > [ 3574.675515][T151844] Read-error on swap-device (254:1:256) > [ 3574.675700][T151844] Read-error on swap-device (254:1:384) > [ 3574.703570][ T971] ata2.00: disabled > [ 3574.710393][T151472] sd 1:0:0:0: [sda] Synchronizing SCSI cache > [ 3574.710864][T151472] sd 1:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=0x04 driverbyte=0x00 > [ 3574.710922][T151472] sd 1:0:0:0: [sda] Stopping disk > [ 3574.711010][T151472] sd 1:0:0:0: [sda] Start/Stop Unit failed: Result: hostbyte=0x04 driverbyte=0x00 > [ 3574.826569][ T674] dm-0: writeback error on inode 68507862, offset 65536, sector 54281504 > [ 3575.117547][ T3366] dm-0: writeback error on inode 68507851, offset 0, sector 54378880 > [ 3575.140104][T151472] pci 0004:03:00.0: Removing from iommu group 3 > [ 3575.141778][T151472] pci 0004:03 : [PE# fb] Releasing PE > [ 3575.141965][T151472] pci 0004:03 : [PE# fb] Removing DMA window #0 > [ 3575.142452][T151472] pci 0004:03 : [PE# fb] Disabling 64-bit DMA bypass > [ 3575.149369][T151472] pci_bus 0004:03: busn_res: [bus 03] is released > [ 3575.150574][T152037] Read-error on swap-device (254:1:35584) > [ 3575.150713][T152037] Read-error on swap-device (254:1:35712) > [ 3575.152632][T152037] Read-error on swap-device (254:1:915584) > [ 3575.152706][T151472] pci_bus 0004:04: busn_res: [bus 04-08] is released > [ 3575.152983][T151472] ============================================================================= > [ 3575.153937][T151472] BUG kmalloc-16 (Not tainted): Object already free > [ 3575.153962][T151472] ----------------------------------------------------------------------------- > [ 3575.153962][T151472] > [ 3575.154020][T151472] Disabling lock debugging due to kernel taint > [ 3575.154047][T151472] INFO: Allocated in pcibios_scan_phb+0x104/0x3e0 age=356904 cpu=4 pid=1 > [ 3575.154084][T151472] __slab_alloc+0xa4/0xf0 > [ 3575.154105][T151472] __kmalloc+0x294/0x330 > [ 3575.154127][T151472] pcibios_scan_phb+0x104/0x3e0 > [ 3575.154165][T151472] pcibios_init+0x84/0x124 > [ 3575.154209][T151472] do_one_initcall+0xac/0x528 > [ 3575.154241][T151472] kernel_init_freeable+0x35c/0x3fc > [ 3575.154272][T151472] kernel_init+0x24/0x148 > [ 3575.154306][T151472] ret_from_kernel_thread+0x5c/0x80 > [ 3575.154352][T151472] INFO: Freed in pcibios_remove_bus+0x70/0x90 age=0 cpu=7 pid=151472 > [ 3575.154387][T151472] kfree+0x49c/0x510 > [ 3575.154406][T151472] pcibios_remove_bus+0x70/0x90 > [ 3575.154443][T151472] pci_remove_bus+0xe4/0x110 > [ 3575.154467][T151472] pci_remove_bus_device+0x74/0x170 > [ 3575.154503][T151472] pci_remove_bus_device+0x4c/0x170 > [ 3575.154524][T151472] pci_stop_and_remove_bus_device_locked+0x34/0x50 > [ 3575.154539][T151472] remove_store+0xc0/0xe0 > [ 3575.154551][T151472] dev_attr_store+0x30/0x50 > [ 3575.154573][T151472] sysfs_kf_write+0x68/0xb0 > [ 3575.154595][T151472] kernfs_fop_write+0x114/0x260 > [ 3575.154643][T151472] vfs_write+0xe4/0x260 > [ 3575.154667][T151472] ksys_write+0x74/0x130 > [ 3575.154692][T151472] system_call_exception+0xf8/0x1d0 > [ 3575.154728][T151472] system_call_common+0xe8/0x218 > [ 3575.154788][T151472] INFO: Slab 0x00000000cafdf25c objects=178 used=174 fp=0x0000000020a64b99 flags=0x7fff8000000201 > [ 3575.154848][T151472] INFO: Object 0x00000000af116201 @offset=5168 fp=0x0000000000000000 > [ 3575.154848][T151472] > [ 3575.154901][T151472] Redzone 00000000781d3b96: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ > [ 3575.154968][T151472] Object 00000000af116201: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk. > [ 3575.155012][T151472] Redzone 000000007b8ec00f: bb bb bb bb bb bb bb bb ........ > [ 3575.155052][T151472] Padding 00000000df7d5e89: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ > [ 3575.155107][T151472] CPU: 7 PID: 151472 Comm: trinity-c61 Tainted: G B 5.9.0-rc6-next-20200922+ #2 > [ 3575.155147][T151472] Call Trace: > [ 3575.155171][T151472] [c000001fefdef800] [c000000000644278] dump_stack+0xec/0x144 (unreliable) > [ 3575.155204][T151472] [c000001fefdef840] [c00000000035e9a8] print_trailer+0x278/0x2a0 > [ 3575.155220][T151472] [c000001fefdef8d0] [c000000000355d9c] free_debug_processing+0x57c/0x600 > [ 3575.155236][T151472] [c000001fefdef9b0] [c000000000356234] __slab_free+0x414/0x5b0 > [ 3575.155251][T151472] [c000001fefdefa80] [c00000000035686c] kfree+0x49c/0x510 > [ 3575.155267][T151472] [c000001fefdefb10] [c000000000043260] pcibios_remove_bus+0x70/0x90 > pci_irq_map_dispose at arch/powerpc/kernel/pci-common.c:456 > (inlined by) pcibios_remove_bus at arch/powerpc/kernel/pci-common.c:461 > [ 3575.155282][T151472] [c000001fefdefb40] [c00000000066fee4] pci_remove_bus+0xe4/0x110 > [ 3575.155309][T151472] [c000001fefdefb70] [c000000000670084] pci_remove_bus_device+0x74/0x170 > [ 3575.155348][T151472] [c000001fefdefbb0] [c000000000670070] pci_remove_bus_device+0x60/0x170 > [ 3575.155377][T151472] [c000001fefdefbf0] [c0000000006701f4] pci_stop_and_remove_bus_device_locked+0x34/0x50 > [ 3575.155395][T151472] [c000001fefdefc20] [c00000000067f5e0] remove_store+0xc0/0xe0 > [ 3575.155431][T151472] [c000001fefdefc70] [c0000000006dccd0] dev_attr_store+0x30/0x50 > [ 3575.155457][T151472] [c000001fefdefc90] [c00000000049fae8] sysfs_kf_write+0x68/0xb0 > [ 3575.155507][T151472] [c000001fefdefcd0] [c00000000049ed14] kernfs_fop_write+0x114/0x260 > [ 3575.155553][T151472] [c000001fefdefd20] [c0000000003ab2d4] vfs_write+0xe4/0x260 > [ 3575.155592][T151472] [c000001fefdefd70] [c0000000003ab604] ksys_write+0x74/0x130 > [ 3575.155630][T151472] [c000001fefdefdc0] [c00000000002a458] system_call_exception+0xf8/0x1d0 > [ 3575.155671][T151472] [c000001fefdefe20] [c00000000000d0a8] system_call_common+0xe8/0x218 > [ 3575.155715][T151472] FIX kmalloc-16: Object at 0x00000000af116201 not freed > [ 3575.156125][T151472] ============================================================================= > [ 3575.156170][T151472] BUG kmalloc-16 (Tainted: G B ): Wrong object count. Counter is 174 but counted were 176 > [ 3575.156204][T151472] ----------------------------------------------------------------------------- > [ 3575.156204][T151472] > [ 3575.156240][T151472] INFO: Slab 0x00000000cafdf25c objects=178 used=174 fp=0x0000000020a64b99 flags=0x7fff8000000201 > [ 3575.156296][T151472] CPU: 7 PID: 151472 Comm: trinity-c61 Tainted: G B 5.9.0-rc6-next-20200922+ #2 > [ 3575.156348][T151472] Call Trace: > [ 3575.156376][T151472] [c000001fefdef6f0] [c000000000644278] dump_stack+0xec/0x144 (unreliable) > [ 3575.156439][T151472] [c000001fefdef730] [c00000000035e688] slab_err+0x78/0xb0 > [ 3575.156502][T151472] [c000001fefdef810] [c0000000003552a4] on_freelist+0x364/0x390 > [ 3575.156541][T151472] [c000001fefdef8b0] [c000000000355aa8] free_debug_processing+0x288/0x600 > [ 3575.156598][T151472] [c000001fefdef990] [c000000000356234] __slab_free+0x414/0x5b0 > [ 3575.156644][T151472] [c000001fefdefa60] [c00000000035686c] kfree+0x49c/0x510 > [ 3575.156701][T151472] [c000001fefdefaf0] [c0000000002b8a10] kfree_const+0x60/0x80 > [ 3575.156738][T151472] [c000001fefdefb10] [c00000000064d3bc] kobject_release+0x7c/0xd0 > [ 3575.156784][T151472] [c000001fefdefb50] [c0000000006de0c0] put_device+0x20/0x40 > [ 3575.156831][T151472] [c000001fefdefb70] [c00000000067015c] pci_remove_bus_device+0x14c/0x170 > [ 3575.156867][T151472] [c000001fefdefbb0] [c000000000670070] pci_remove_bus_device+0x60/0x170 > [ 3575.156923][T151472] [c000001fefdefbf0] [c0000000006701f4] pci_stop_and_remove_bus_device_locked+0x34/0x50 > [ 3575.156998][T151472] [c000001fefdefc20] [c00000000067f5e0] remove_store+0xc0/0xe0 > [ 3575.157058][T151472] [c000001fefdefc70] [c0000000006dccd0] dev_attr_store+0x30/0x50 > [ 3575.157114][T151472] [c000001fefdefc90] [c00000000049fae8] sysfs_kf_write+0x68/0xb0 > [ 3575.157169][T151472] [c000001fefdefcd0] [c00000000049ed14] kernfs_fop_write+0x114/0x260 > [ 3575.157215][T151472] [c000001fefdefd20] [c0000000003ab2d4] vfs_write+0xe4/0x260 > [ 3575.157256][T151472] [c000001fefdefd70] [c0000000003ab604] ksys_write+0x74/0x130 > [ 3575.157301][T151472] [c000001fefdefdc0] [c00000000002a458] system_call_exception+0xf8/0x1d0 > [ 3575.157352][T151472] [c000001fefdefe20] [c00000000000d0a8] system_call_common+0xe8/0x218 > [ 3575.157396][T151472] FIX kmalloc-16: Object count adjusted. > [ 3575.157457][T151472] pci_bus 0004:09: busn_res: [bus 09-0d] is released > > >> --- >> >> Changes since v2: >> >> - merged 2 patches. >> >> arch/powerpc/include/asm/pci-bridge.h | 6 ++ >> arch/powerpc/kernel/pci-common.c | 114 ++++++++++++++++++++++++++ >> 2 files changed, 120 insertions(+) >> >> diff --git a/arch/powerpc/include/asm/pci-bridge.h >> b/arch/powerpc/include/asm/pci-bridge.h >> index b92e81b256e5..ca75cf264ddf 100644 >> --- a/arch/powerpc/include/asm/pci-bridge.h >> +++ b/arch/powerpc/include/asm/pci-bridge.h >> @@ -48,6 +48,9 @@ struct pci_controller_ops { >> >> /* >> * Structure of a PCI controller (host bridge) >> + * >> + * @irq_count: number of interrupt mappings >> + * @irq_map: interrupt mappings >> */ >> struct pci_controller { >> struct pci_bus *bus; >> @@ -127,6 +130,9 @@ struct pci_controller { >> >> void *private_data; >> struct npu *npu; >> + >> + unsigned int irq_count; >> + unsigned int *irq_map; >> }; >> >> /* These are used for config access before all the PCI probing >> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci- >> common.c >> index be108616a721..deb831f0ae13 100644 >> --- a/arch/powerpc/kernel/pci-common.c >> +++ b/arch/powerpc/kernel/pci-common.c >> @@ -353,6 +353,115 @@ struct pci_controller >> *pci_find_controller_for_domain(int domain_nr) >> return NULL; >> } >> >> +/* >> + * Assumption is made on the interrupt parent. All interrupt-map >> + * entries are considered to have the same parent. >> + */ >> +static int pcibios_irq_map_count(struct pci_controller *phb) >> +{ >> + const __be32 *imap; >> + int imaplen; >> + struct device_node *parent; >> + u32 intsize, addrsize, parintsize, paraddrsize; >> + >> + if (of_property_read_u32(phb->dn, "#interrupt-cells", &intsize)) >> + return 0; >> + if (of_property_read_u32(phb->dn, "#address-cells", &addrsize)) >> + return 0; >> + >> + imap = of_get_property(phb->dn, "interrupt-map", &imaplen); >> + if (!imap) { >> + pr_debug("%pOF : no interrupt-map\n", phb->dn); >> + return 0; >> + } >> + imaplen /= sizeof(u32); >> + pr_debug("%pOF : imaplen=%d\n", phb->dn, imaplen); >> + >> + if (imaplen < (addrsize + intsize + 1)) >> + return 0; >> + >> + imap += intsize + addrsize; >> + parent = of_find_node_by_phandle(be32_to_cpup(imap)); >> + if (!parent) { >> + pr_debug("%pOF : no imap parent found !\n", phb->dn); >> + return 0; >> + } >> + >> + if (of_property_read_u32(parent, "#interrupt-cells", &parintsize)) { >> + pr_debug("%pOF : parent lacks #interrupt-cells!\n", phb->dn); >> + return 0; >> + } >> + >> + if (of_property_read_u32(parent, "#address-cells", ¶ddrsize)) >> + paraddrsize = 0; >> + >> + return imaplen / (addrsize + intsize + 1 + paraddrsize + parintsize); >> +} >> + >> +static void pcibios_irq_map_init(struct pci_controller *phb) >> +{ >> + phb->irq_count = pcibios_irq_map_count(phb); >> + if (phb->irq_count < PCI_NUM_INTX) >> + phb->irq_count = PCI_NUM_INTX; >> + >> + pr_debug("%pOF : interrupt map #%d\n", phb->dn, phb->irq_count); >> + >> + phb->irq_map = kcalloc(phb->irq_count, sizeof(unsigned int), >> + GFP_KERNEL); >> +} >> + >> +static void pci_irq_map_register(struct pci_dev *pdev, unsigned int virq) >> +{ >> + struct pci_controller *phb = pci_bus_to_host(pdev->bus); >> + int i; >> + >> + if (!phb->irq_map) >> + return; >> + >> + for (i = 0; i < phb->irq_count; i++) { >> + /* >> + * Look for an empty or an equivalent slot, as INTx >> + * interrupts can be shared between adapters. >> + */ >> + if (phb->irq_map[i] == virq || !phb->irq_map[i]) { >> + phb->irq_map[i] = virq; >> + break; >> + } >> + } >> + >> + if (i == phb->irq_count) >> + pr_err("PCI:%s all platform interrupts mapped\n", >> + pci_name(pdev)); >> +} >> + >> +/* >> + * Clearing the mapped interrupts will also clear the underlying >> + * mappings of the ESB pages of the interrupts when under XIVE. It is >> + * a requirement of PowerVM to clear all memory mappings before >> + * removing a PHB. >> + */ >> +static void pci_irq_map_dispose(struct pci_bus *bus) >> +{ >> + struct pci_controller *phb = pci_bus_to_host(bus); >> + int i; >> + >> + if (!phb->irq_map) >> + return; >> + >> + pr_debug("PCI: Clearing interrupt mappings for PHB %04x:%02x...\n", >> + pci_domain_nr(bus), bus->number); >> + for (i = 0; i < phb->irq_count; i++) >> + irq_dispose_mapping(phb->irq_map[i]); >> + >> + kfree(phb->irq_map); >> +} >> + >> +void pcibios_remove_bus(struct pci_bus *bus) >> +{ >> + pci_irq_map_dispose(bus); >> +} >> +EXPORT_SYMBOL_GPL(pcibios_remove_bus); >> + >> /* >> * Reads the interrupt pin to determine if interrupt is use by card. >> * If the interrupt is used, then gets the interrupt line from the >> @@ -401,6 +510,8 @@ static int pci_read_irq_line(struct pci_dev *pci_dev) >> >> pci_dev->irq = virq; >> >> + /* Record all interrut mappings for later removal of a PHB */ >> + pci_irq_map_register(pci_dev, virq); >> return 0; >> } >> >> @@ -1554,6 +1665,9 @@ void pcibios_scan_phb(struct pci_controller *hose) >> >> pr_debug("PCI: Scanning PHB %pOF\n", node); >> >> + /* Allocate interrupt mappings array */ >> + pcibios_irq_map_init(hose); >> + >> /* Get some IO space for the new PHB */ >> pcibios_setup_phb_io_space(hose); >> >