Re: [PATCH V6] drm/i915: Disable stolen memory when i915 runs in guest vm

"Zhang, Xiong Y" <xiong.y.zhang@xxxxxxxxx> · Wed, 3 May 2017 09:22:22 +0000

> > + David and Jon
> >
> > On ti, 2017-04-25 at 18:34 +0800, Xiong Zhang wrote:
> >
> > The blocking issue I see is that bisecting is still not pointing at
> > relevant commits. Both bisected commits from Bugzilla are not related
> > to changes in stolen memory usage behavior. I'd assume a successful
> > bisect to land at the patches where we start creating kernel internal
> > objects from stolen memory. Otherwise we could be ignoring a bug
> > elsewhere. If it consistently lands on those patches, then there might
> > be something wrong with them, in addition to stolen memory problems.
> [Zhang, Xiong Y] I only try kernel 4.8 and 4.9 above, as the bugzilla descripted,
> guest 4.8 kernel doesn't see gpu hang in guest dmesg, 4.9 kernel has gpu hang
> in guest dmesg. From this point, we could do git bisect.
> But tons of IOMMU DMA R/W exception to stolen memory exist in host dmesg
> when guest kernel is 4.8 and 4.9. This means guest domain iommu table
> doesn't
> have mapping for stolen memory and IGD fail in accessing stolen memory
> from guest kernel 4.8 and 4.9. From this point, this issue isn't a regression and
> shouldn't go git bisect. You could check this host error message from the
> bugzilla
> attachment. And this should be fixed first.
> Anyway, I will try my best to get the ideal commit through git bisect, but I'm
> afraid
> the result is the same as past because we don't have a stable good point to
> start git
> bisect.
[Zhang, Xiong Y] hi, Joonas:
As you said, the gpu hang exist because i915 create ring buffer from stolen memory.
I did git bisect again, and the following commit is the first bad commit:
commit c58b735fc762e891481e92af7124b85cb0a51fce
Author: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Date:   Thu Aug 18 17:16:57 2016 +0100

    drm/i915: Allocate rings from stolen

    If we have stolen available, make use of it for ringbuffer allocation.
    Previously this was restricted to !llc platforms, as writing to stolen
    requires a GGTT mapping - but now that we have partial mappable support,
    the mappable aperture isn't quite so precious so we can use it more
    freely and ringbuffers are a good user for the otherwise wasted stolen.

After reverting this patch from drm-intel-nightly, I didn't see gpu hang during guest boot process.
So what's our next step ?

thanks
> 
> > Disabling power saving makes many bugs go away, but we still don't
> > disable power saving as a resolution to such bugs, but instead root
> > cause and fix the individual bugs.
> [Zhang, Xiong Y] I add i915.enable_rc6=0, i915.enable_dc=0,
> i915.enable_fbc=0,
> I915.enable_psr=0, i915.disable_power_well=0,i915.enable_ips=0 to grub.
> But gpu hang exist in guest and DMA R/W error exist in host.
> >
> > > Stolen memory isn't a standard pci resource and exists in RMRR which has
> > > identity mapping in iommu table when host boot up, so IGD could access
> > > stolen memory in host OS. While according to 'commit c875d2c1b808
> > > ("iommu/vt-d: Exclude devices using RMRRs from IOMMU API
> > domains")',RMRR
> > > isn't supported by kvm, then both EPT and guest iommu domain table lack
> > > of maaping for stolen memory in kvm IGD passthrough environment.
> >
> > Commit message text still fails to address that an exclusion was added
> > by commit:
> >
> > commit 18436afdc11a00ac881990b454cfb2eae81d6003
> > Author: David Woodhouse <David.Woodhouse@xxxxxxxxx>
> > Date:   Wed Mar 25 15:05:47 2015 +0000
> >
> >     iommu/vt-d: Allow RMRR on graphics devices too
> >
> >     Commit c875d2c1 ("iommu/vt-d: Exclude devices using RMRRs from
> > IOMMU API
> >     domains") prevents certain options for devices with RMRRs. This even
> >     prevents those devices from getting a 1:1 mapping with 'iommu=pt',
> >     because we don't have the code to handle *preserving* the RMRR
> > regions
> >     when moving the device between domains.
> >
> > <SNIP>
> >
> > The quoted part of David's commit message leads me to believe it's
> > simply lack of some code in kernel for juggling the RMRRs when moving a
> > device between domains that is missing. Why is not that considered
> > instead? With that implemented, we would have more transparent pass-
> > through, which should be good.
> [Zhang, Xiong Y] c875d2c1 ("iommu/vt-d: Exclude devices using RMRRs from
> IOMMU API domains). This patch prevent devices associated with RMRRs from
> assigning to a guest, the one of reason is it knows RMRR isn't supported in
> guest
> domain IOMMU table, If these device's driver still access RMRR from guest,
> serious error will happen.
> 18436afdc ("iommu/vt-d: Allow RMRR on graphics devices too "), add an
> exception
> to above commit. So IGD could be assigned to a guest. But this doesn't mean
> IGD
> 1:1 mapping for RMRR will be support in guest domain iommu table
> 'iommu=pt' is to set 1:1 mapping for all pci device in host domain iommu
> table.
> 
> When one device is assigned to a guest and this guest boot up, this guest
> domain
> Iommu table will take place of host domain iommu table on hardware. Our
> issue
> is guest domain iommu table doesn't have 1:1 mapping for RMRR.
> In order to set up 1:1 mapping for RMRR in guest domain iommu table, we
> have
> to modify kvm and qemu and kvm community have declined this.
> >
> > Also, was fixing the IGD driver loading with zero stolen memory
> > considered instead? All this information should exist in the commit
> > message.
> [Zhang, Xiong Y] IGD and i915 driver read pci config register 0x50 to get
> the size of stolen memory. When guest read this register, qemu could trap
> it and return one value to guest.
> So in order to  " fixing the IGD driver loading with zero stolen memory ",
> We have to modify both Qemu and IGD driver:
> 1) QEMU: trap read from pci cfg 0x50 register, then return zero to guest
> 2) IGD driver: when IGD driver see zero size of stolen memory, don't exit
> loading
> and continue.
> This doesn't give any benefit to i915, i915 will still disable stolen memory as
> i915
> see zero size stolen memory . So I prefer to disable stolen memory in i915
> directly
> and keep Qemu and IGD driver unchanged.
> >
> > After the bisecting is properly done, there is an agreement that
> > suggested RMRR preservation is absolutely a no-go, other options are
> > not viable, the commit message should be updated to reflect all that.
> > Then we should look in more detail on how to detect the scenarios when
> > we're running in a virtual machine that doesn't set up the 1:1 mapping
> > for RMRRs.
> [Zhang, Xiong Y] Sure, I will do this once we have an agreement.
> I really need the help from others who could correct me if I am wrong.
> >
> > Regards, Joonas
> > --
> > Joonas Lahtinen
> > Open Source Technology Center
> > Intel Corporation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx