Re: [PATCH 2/2] xHCI: fix bug of resume for VIA xHCI host

Elric Fu <elricfu1@xxxxxxxxx> · Sat, 17 Mar 2012 19:01:47 +0800

2012/3/17 Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx>:
> On Thu, Mar 15, 2012 at 05:19:15PM +0800, Elric Fu wrote:
>> 2012/3/15 Xu, Andiry <Andiry.Xu@xxxxxxx>:
>> >> 2012/3/15 Xu, Andiry <Andiry.Xu@xxxxxxx>:
>> >> >> The Save/Restore state funtion of VIA xHCI host is unstable. So add
>> >> >> the XHCI_RESET_ON_RESUME quirk for it.
>> >> >>
>> >> >
>> >> > Can you provide more background of the issue? What's the unstable
>> >> > behavior of VIA host after resume?
>> >>
>> >> The issue of using the Save/Restore state operation will cause the
>> >> device attached to the port of host is disconnected.
>> >>
>>
>> Sorry. My description about the issue is incorrect. If using the save/restore
>> state function, after resume, the superspeed devices work fine but the usb
>> 2.0 devices will be disconnected. It seems like the embedded 2.0 hub in
>> VIA host controller can't be re-enumerated, so the 2.0 devices will be
>> disconnected. I don't know why.
>>
>> >
>> > Hmm. I think xHCI suspend path may need to be save DCBAA and all related
>> > data structures before remove core well power. It's specified by xHCI
>> > 1.0 errata.
>>
>> Well. Saving the DCBAA, Context and other data structures maybe fix this
>> kind of issue.
>
> But the xHCI driver does save the DCBAA.  It's in the xhci_hcd
> structure.  We restore the pointer to the dcbaa in
> xhci_restore_registers(), which is called after a resume from suspend:
>
> static void xhci_restore_registers(struct xhci_hcd *xhci)
> {
>        xhci_writel(xhci, xhci->s3.command, &xhci->op_regs->command);
>        xhci_writel(xhci, xhci->s3.dev_nt, &xhci->op_regs->dev_notification);
>        xhci_write_64(xhci, xhci->s3.dcbaa_ptr, &xhci->op_regs->dcbaa_ptr);
> ...

Sorry, I forgot that. But about this I have a question. The errata
file specify the system software need to save a memory image of
DCBAA, Contexts and other data structures. Why? May DCBAA
and those data structures be corrupted? or other reasons? Driver
have already kept the DCBAAP.

I tested the save/restore state operation in VIA xHCI host. The
superspeed device attached to the usb3 root hub works fine and
the integrated 2.0 hub can't be re-enumerated. The address
device command sent to integrated 2.0 hub was failed. That's a
little stranger. The usb analyzer can't capture the trace between
the root hub and integrated usb2.0 hub in VIA host. I need to
discuss this with the FW team.

>
> We don't restore the register state after a hibernate, and I've been
> wondering why.  Andiry, I think you (or one of the AMD team) wrote the
> suspend/resume code.  Do you remember why we unconditionally do not
> restore the registers on resume from hibernate?
>
> int xhci_resume(struct xhci_hcd *xhci, bool hibernated)
> {
> ...
>        if (!hibernated) {
>                /* restore register code */
>        }
>        /* If restore operation fails, re-initialize the HC during resume */
>        if ((temp & STS_SRE) || hibernated) {
>        }
>
> I know that power is completely removed during hibernate, so maybe we
> just can't restore the registers?

Heard from FW developer, if we reset and re-initialize the host the save
and restore state function will be invalid. I think those states saved will
be lost if we reset the host, and we have to reset the host as the power
is removed.

>
> Now, I did look at the xHCI 1.0 spec errata from last June, and I think
> we're saving and restoring the register state in the wrong order.
> Patches to follow shortly.
>
>> >> >> When driver free Event Ring in the resume procedure, driver can't
>> >> >> write zero to Event Ring Registers in VIA xHCI host, otherwise the
>> >> >> VIA host can't be resumed.
>> >> >>
>> >> >
>> >> > Strange. What do you mean by "can't resume"? Does xhci_resume() return
>> >> > failure, or it returns success but host does not work after resume?
>> >>
>> >> The issue is the host will hang up. The reason is that the host has a
>> >> hardware state machine, in the reset-on-resume procedure driver will
>> >> reset the host first then free event ring and re-initialize the host. But
>> >> if driver set 0 to event ring registers in xhci_mem_cleanup() after reset
>> >> the host, the host will think the "0" is a vaild value, and fetch it. it
>> >> result
>> >> in the host hang up.
>> >>
>> >
>> > xhci_halt() is called before xhci_reset(). Why does the host fetch event ring
>> > registers after it's halted?
>>
>> I don't know. As far as I know, after reset the VIA host,  the host
>> will fetch the
>> value of event ring register as long as the driver write those registers. Then
>> after driver write 0 to those registers the host thinks the value is
>> valid, maybe
>> do something to event ring after fetching the value. I don't know the more
>> details about it.
>>
>> >> > If the host needs a reset-on-resume, it will perform a complete
>> >> > re-initialization. After xhci_mem_cleanup() called, xhci_resume() will
>> >> > call xhci_init(), in which the ERST registers are re-written. Have you
>> >> > checked the value, see if it's different form before suspend?
>> >>
>> >> I don't think in reset-on-resume write 0 to event Ring registers is
>> >> necessary.
>> >> As driver already reset the host before xhci_mem_cleanup(), the value of
>> >> event ring registers will be 0.
>
> As Eric pointed out, the solution seems to just be to not write zeroes
> to the DCBAA on memory clean up.  After halt and reset, the xHCI host
> DCBAA entry will be reset to its default (zero), so there's no reason to
> write it again.  If the host *isn't* halted because the stop command
> write timed out, we really don't want to write anything to the DCBAA.
>

About this I talked with the FW developers. They said the chip had a
hardware state machine and after the driver reset the host the state
machine transitioned into a state (I don't know what state) and host will
fetch the event ring registers as long as driver write those registers.
So driver call the xhci_mem_cleanup() after xhci_reset(), the host will
think 0 wrote by xhci_mem_cleanup() is valid. That result in the host
is in trouble.

But your opinion may be right. I need to test and verify it.

>> >> > Also, note xhci_mem_cleanup is called not only by xhci_resume(), but also
>> >> > by xhci_stop(). This means you may find the same issue by rmmod/modprobe
>> >> > xhci-hcd.
>> >>
>> >> rmmod/modprobe xhci-hcd is ok. I have already tested it. I think it is
>> >> ok if we write
>> >> a valid value except for 0 to event ring registers in VIA xHCI host
>> >> after reset the
>> >> host.
>
> That's very strange.  The shutdown code in the failed restore case is
> nearly identical to the code in xhci_stop() that's called in the
> shutdown case.  The only difference is that xhci_halt() is called twice
> in the shutdown case, so maybe the VIA host is not actually halted on
> the failed register restore path?  Maybe we need to increase
> XHCI_MAX_HALT_USEC?
>
> Actually, the shutdown and hibernate/failed register restore paths
> functions desperately need to be refactored to share code.  It's
> basically (but not quite) copy-paste between the two functions.  For
> instance, there's a call to usb_amd_dev_put() in xhci_stop(), but not in
> the failed register restore/hibernate case.
>
> Andiry, is that the PLL shutdown behavior you wanted?
>
> Eric, I'm going to send you a couple of patches, and I want you to apply
> them one at a time and try suspending and resuming (hopefully with
> netconsole or something so you can capture log messages on resume).
>
> The first patch adds a warning to xhci_halt in case the host doesn't
> actually halt.  We should see if your failed resume is due to the host
> not actually being halted when DCBAA is written.
>
> The second removes the writes of zero to the DCBAA and command ring
> pointers in xhci_mem_cleanup(), which should remove the hang all
> together.
>
> The third increases the xHCI halt timeout, and should only be applied if
> you find the host isn't halting in time.
>
> The fourth changes the order of the register saves and restores to match
> the xHCI 1.0 errata.  That may make your host indicate the register
> restore actually worked, which will mean the xHCI driver doesn't need to
> re-initialize the host at all.

OK. I will apply and test those patches. Thanks.

Best Regards,
Elric Fu

>
> Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html