[+cc linux-mm, linux-kernel] For context, the start of this discussion was here: http://lkml.kernel.org/r/1424157203-691-8-git-send-email-gwshan@xxxxxxxxxxxxxxxxxx where Gavin is adding a new PCI hotplug driver for PowerNV. That new driver calls vm_unmap_aliases() the same way we do in the existing RPA hotplug driver here: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/hotplug/rpaphp_core.c#n432 I'm trying to figure out whether it's correct to use vm_unmap_aliases() here, but I'm not an mm person so all I have is my gut feeling that something doesn't smell right. On Tue, Feb 17, 2015 at 6:30 PM, Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> wrote: > On Wed, 2015-02-18 at 11:16 +1100, Gavin Shan wrote: >> >What is vm_unmap_aliases() for? I see this is probably copied from >> >rpaphp_core.c, where it was added by b4a26be9f6f8 ("powerpc/pseries: >> Flush >> >lazy kernel mappings after unplug operations"). >> > >> >But I don't know whether: >> > >> > - this is something specific to powerpc, >> > - the lack of vm_unmap_aliases() in other hotplug paths is a bug, >> > - the fact that we only do this on powerpc is covering up a >> > powerpc bug somewhere >> >> Yes, I copied this piece of code from rpaphp_core.c. I think Ben might >> help to answer the questions as he added the patch. I had very quick >> check on mm/vmalloc.c and it's reasonable to have vm_unmap_aliases() >> here to flush TLB entries for ioremap() regions, which were unmapped >> previously. if I'm correct. I don't think it's powerpc specific. > > It's specific to running under the PowerVM hypervisor, and thus doesn't > affect PowerNV, just don't copy it over. > > It comes from the fact that the generic ioremap code nowadays delays > TLB flushing on unmap. The TLB flushing code is what, on powerpc, > ensures that we remove the translations from the MMU hash table (the > hash table is essentially treated as an extended in-memory TLB), which > on pseries turns into hypervisor calls. > > When running under that hypervisor, the HV ensures that no translation > still exists in the hash before allowing a device to be removed from > a partition. If translations still exist, the removal fails. > > So we need to force the generic ioremap code to perform all the TLB > flushes for iounmap'ed regions before we "complete" the unplug operation > from a kernel perspective so that the device can be re-assigned to > another partition. > > This is thus useless on platforms like powernv which do not run under > such a hypervisor. So the hypervisor call that removes the device from the partition will fail if there are any translations that reference the memory of the device. Let me go through this in excruciating detail to see if I understand what's going on: - PCI core enumerates device D1 - PCI core sets device D1 BAR 0 = 0x1000 - driver claims D1 - driver ioremaps 0x1000 at virtual address V - translation V -> 0x1000 is in TLB - driver iounmaps V (but V -> 0x1000 translation may remain in TLB) - driver releases D1 - hot-remove D1 (without vm_unmap_aliases(), hypervisor would fail this) - it would be a bug to reference V here, but if we did, the virt-to-phys translation would succeed and we'd have a Master Abort or Unsupported Request on PCI/PCIe - hot-add D2 - PCI core enumerates device D2 - PCI core sets device D2 BAR 0 = 0x1000 - it would be a bug to reference V here (before ioremapping), but if we did, the reference would reach D2 I don't see anything hypervisor-specific here except for the fact that the hypervisor checks for existing translations and most other platforms don't. But it seems like the unexpected PCI aborts could happen on any platform. Are we saying that those PCI aborts are OK, since it's a bug to make those references in the first place? Or would we rather take a TLB miss fault instead so the references never make it to PCI? I would think there would be similar issues when unmapping and re-mapping plain old physical memory. But I don't see vm_unmap_aliases() calls there, so those issues must be handled differently. Should we handle this PCI hotplug issue the same way we handle RAM? Bjorn -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>