On 05/03/2015 13:03, Catalin Marinas wrote: >> > >> > I'd hate to have to do that. PCI should be entirely probeable >> > given that we tell the guest where the host bridge is, that's >> > one of its advantages. > I didn't say a DT node per device, the DT doesn't know what PCI devices > are available (otherwise it defeats the idea of probing). But we need to > tell the OS where the host bridge is via DT. > > So the guest would be told about two host bridges: one for real devices > and another for virtual devices. These can have different coherency > properties. Yeah, and it would suck that the user needs to know the difference between the coherency proprties of the host bridges. It would especially suck if the user has a cluster with different machines, some of them coherent and others non-coherent, and then has to debug why the same configuration works on some machines and not on others. To avoid replying in two different places, which of the solutions look to me like something that half-works? Pretty much all of them, because in the end it is just a processor misfeature. For example, Intel virtualization extensions let the hypervisor override stage1 translation _if necessary_. AMD doesn't, but has some other quirky things that let you achieve the same effect.. In particular, I am not even sure that this is about bus coherency, because this problem does not happen when the device is doing bus master DMA. Working around coherency for bus master DMA would be easy. The problem arises with MMIO areas that the guest can reasonably expect to be uncacheable, but that are optimized by the host so that they end up backed by cacheable RAM. It's perfectly reasonable that the same device needs cacheable mapping with one userspace, and works with uncacheable mapping with another userspace that doesn't optimize the MMIO area to RAM. Currently the VGA framebuffer is the main case where this happen, and I don't expect many more. Because this is not bus master DMA, it's hard to find a QEMU API that can be hooked to invalidate the cache. QEMU is just reading from an array of chars. In practice, the VGA framebuffer has an optimization that uses dirty page tracking, so we could piggyback on the ioctls that return which pages are dirty. It turns out that piggybacking on those ioctls also should fix the case of migrating a guest while the MMU is disabled. But it's a hack, and it may not work for other devices. We could use _DSD to export the device tree property separately for each device, but that wouldn't work for hotplugged devices. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html