Re: [PATCH] drm/radeon: fix asic initialization for virtualized environments

Alex Deucher <alexdeucher@xxxxxxxxx> · Wed, 15 Jun 2016 13:00:10 -0400

On Wed, Jun 15, 2016 at 12:45 PM, Alex Williamson
<alex.williamson@xxxxxxxxxx> wrote:
> On Wed, 15 Jun 2016 02:23:37 -0400
> Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
>
>> On Mon, Jun 13, 2016 at 4:10 PM, Alex Williamson
>> <alex.williamson@xxxxxxxxxx> wrote:
>> > On Mon, 13 Jun 2016 15:45:20 -0400
>> > Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
>> >
>> >> When executing in a PCI passthrough based virtuzliation environment, the
>> >> hypervisor will usually attempt to send a PCIe bus reset signal to the
>> >> ASIC when the VM reboots. In this scenario, the card is not correctly
>> >> initialized, but we still consider it to be posted. Therefore, in a
>> >> passthrough based environemnt we should always post the card to guarantee
>> >> it is in a good state for driver initialization.
>> >>
>> >> Ported from amdgpu commit:
>> >> amdgpu: fix asic initialization for virtualized environments
>> >>
>> >> Cc: Andres Rodriguez <andres.rodriguez@xxxxxxx>
>> >> Cc: Alex Williamson <alex.williamson@xxxxxxxxxx>
>> >> Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
>> >> Cc: stable@xxxxxxxxxxxxxxx
>> >> ---
>> >>  drivers/gpu/drm/radeon/radeon_device.c | 21 +++++++++++++++++++++
>> >>  1 file changed, 21 insertions(+)
>> >
>> > Thanks, I expect it's an improvement, though it's always a bit
>> > disappointing when a driver starts modifying its behavior based on
>> > what might be a transient feature of the platform, in this case a
>> > hypervisor platform.  For instance, why does our bus reset and video
>> > ROM execution result in a different state than a physical BIOS doing
>> > the same?  Can't this condition occur regardless of a hypervisor,
>>
>> Just doing a pci reset is not enough on newer cards.  The hw handling
>> pci resets changed in CI and more of the logic moved to the driver.
>
> Gag, please relay my disapproval to your hardware folks.
>
>> That does a limited reset, but not the registers that the driver
>> checks to determine whether or not the asic has been posted so the
>> driver skips posting and leaves the hw in a bad reset state.
>>
>> > perhaps a rare hot-add of a GPU, a bare metal kexec reboot, or perhaps
>> > simply a system BIOS optimized to post a limited set of devices.
>>
>> We can tell if a card has never been posted and properly post it.
>> Where it's tricky is when a card has been posted and has subsequently
>> been pci reset on CI and newer hw.  I'm not sure of a good way to
>> detect this particular scenario.  Generally this is mainly done for
>> qemu/kvm.
>
> How do you tell if a card has never been posted?  Is it something we
> could easily toggle after a bus reset?

We check CONFIG_MEMSIZE which is a scratch register set by the
asic_init command table to tell the driver how much vram is on the
board.

>
>> > Detection based on some state of the device rather than an expectation
>> > based on what the device is running on seems preferable.  I suspect
>> > Andres' patch for amdgpu only affects newer devices, which pretty much
>> > all suffer reset issues, at least under QEMU/VFIO, but I wonder how this
>> > patch affects existing working devices, like 6, 7, and some 8-series.
>>
>> Posting the asic at init time should be safe on all asics.
>>
>> > Anyway, if this is the solution to the poor behavior we've seen with
>> > assigned AMD cards, maybe someone could request the same for the closed
>> > drivers, including Windows.  Thanks,
>>
>> The closed drivers already do this.
>
> Hmm, that's not terribly encouraging then since the majority of users
> are running Windows guests for the purpose of creating a gaming VM and
> still experiencing reset issues with the closed drivers there.  Thanks,

I'll have to check with the windows team to see how much validation
they do with the windows driver as a qemu/kvm guest.  It could be that
they don't properly detect that as a virtual case.

Alex
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html