> > + David and Jon > > > > On ti, 2017-04-25 at 18:34 +0800, Xiong Zhang wrote: > > > > The blocking issue I see is that bisecting is still not pointing at > > relevant commits. Both bisected commits from Bugzilla are not related > > to changes in stolen memory usage behavior. I'd assume a successful > > bisect to land at the patches where we start creating kernel internal > > objects from stolen memory. Otherwise we could be ignoring a bug > > elsewhere. If it consistently lands on those patches, then there might > > be something wrong with them, in addition to stolen memory problems. > [Zhang, Xiong Y] I only try kernel 4.8 and 4.9 above, as the bugzilla descripted, > guest 4.8 kernel doesn't see gpu hang in guest dmesg, 4.9 kernel has gpu hang > in guest dmesg. From this point, we could do git bisect. > But tons of IOMMU DMA R/W exception to stolen memory exist in host dmesg > when guest kernel is 4.8 and 4.9. This means guest domain iommu table > doesn't > have mapping for stolen memory and IGD fail in accessing stolen memory > from guest kernel 4.8 and 4.9. From this point, this issue isn't a regression and > shouldn't go git bisect. You could check this host error message from the > bugzilla > attachment. And this should be fixed first. > Anyway, I will try my best to get the ideal commit through git bisect, but I'm > afraid > the result is the same as past because we don't have a stable good point to > start git > bisect. [Zhang, Xiong Y] hi, Joonas: As you said, the gpu hang exist because i915 create ring buffer from stolen memory. I did git bisect again, and the following commit is the first bad commit: commit c58b735fc762e891481e92af7124b85cb0a51fce Author: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> Date: Thu Aug 18 17:16:57 2016 +0100 drm/i915: Allocate rings from stolen If we have stolen available, make use of it for ringbuffer allocation. Previously this was restricted to !llc platforms, as writing to stolen requires a GGTT mapping - but now that we have partial mappable support, the mappable aperture isn't quite so precious so we can use it more freely and ringbuffers are a good user for the otherwise wasted stolen. After reverting this patch from drm-intel-nightly, I didn't see gpu hang during guest boot process. So what's our next step ? thanks > > > Disabling power saving makes many bugs go away, but we still don't > > disable power saving as a resolution to such bugs, but instead root > > cause and fix the individual bugs. > [Zhang, Xiong Y] I add i915.enable_rc6=0, i915.enable_dc=0, > i915.enable_fbc=0, > I915.enable_psr=0, i915.disable_power_well=0,i915.enable_ips=0 to grub. > But gpu hang exist in guest and DMA R/W error exist in host. > > > > > Stolen memory isn't a standard pci resource and exists in RMRR which has > > > identity mapping in iommu table when host boot up, so IGD could access > > > stolen memory in host OS. While according to 'commit c875d2c1b808 > > > ("iommu/vt-d: Exclude devices using RMRRs from IOMMU API > > domains")',RMRR > > > isn't supported by kvm, then both EPT and guest iommu domain table lack > > > of maaping for stolen memory in kvm IGD passthrough environment. > > > > Commit message text still fails to address that an exclusion was added > > by commit: > > > > commit 18436afdc11a00ac881990b454cfb2eae81d6003 > > Author: David Woodhouse <David.Woodhouse@xxxxxxxxx> > > Date: Wed Mar 25 15:05:47 2015 +0000 > > > > iommu/vt-d: Allow RMRR on graphics devices too > > > > Commit c875d2c1 ("iommu/vt-d: Exclude devices using RMRRs from > > IOMMU API > > domains") prevents certain options for devices with RMRRs. This even > > prevents those devices from getting a 1:1 mapping with 'iommu=pt', > > because we don't have the code to handle *preserving* the RMRR > > regions > > when moving the device between domains. > > > > <SNIP> > > > > The quoted part of David's commit message leads me to believe it's > > simply lack of some code in kernel for juggling the RMRRs when moving a > > device between domains that is missing. Why is not that considered > > instead? With that implemented, we would have more transparent pass- > > through, which should be good. > [Zhang, Xiong Y] c875d2c1 ("iommu/vt-d: Exclude devices using RMRRs from > IOMMU API domains). This patch prevent devices associated with RMRRs from > assigning to a guest, the one of reason is it knows RMRR isn't supported in > guest > domain IOMMU table, If these device's driver still access RMRR from guest, > serious error will happen. > 18436afdc ("iommu/vt-d: Allow RMRR on graphics devices too "), add an > exception > to above commit. So IGD could be assigned to a guest. But this doesn't mean > IGD > 1:1 mapping for RMRR will be support in guest domain iommu table > 'iommu=pt' is to set 1:1 mapping for all pci device in host domain iommu > table. > > When one device is assigned to a guest and this guest boot up, this guest > domain > Iommu table will take place of host domain iommu table on hardware. Our > issue > is guest domain iommu table doesn't have 1:1 mapping for RMRR. > In order to set up 1:1 mapping for RMRR in guest domain iommu table, we > have > to modify kvm and qemu and kvm community have declined this. > > > > Also, was fixing the IGD driver loading with zero stolen memory > > considered instead? All this information should exist in the commit > > message. > [Zhang, Xiong Y] IGD and i915 driver read pci config register 0x50 to get > the size of stolen memory. When guest read this register, qemu could trap > it and return one value to guest. > So in order to " fixing the IGD driver loading with zero stolen memory ", > We have to modify both Qemu and IGD driver: > 1) QEMU: trap read from pci cfg 0x50 register, then return zero to guest > 2) IGD driver: when IGD driver see zero size of stolen memory, don't exit > loading > and continue. > This doesn't give any benefit to i915, i915 will still disable stolen memory as > i915 > see zero size stolen memory . So I prefer to disable stolen memory in i915 > directly > and keep Qemu and IGD driver unchanged. > > > > After the bisecting is properly done, there is an agreement that > > suggested RMRR preservation is absolutely a no-go, other options are > > not viable, the commit message should be updated to reflect all that. > > Then we should look in more detail on how to detect the scenarios when > > we're running in a virtual machine that doesn't set up the 1:1 mapping > > for RMRRs. > [Zhang, Xiong Y] Sure, I will do this once we have an agreement. > I really need the help from others who could correct me if I am wrong. > > > > Regards, Joonas > > -- > > Joonas Lahtinen > > Open Source Technology Center > > Intel Corporation