Re: Regression 5.14.21 vs 5.15.1: mmc blocked: mmc0: cqhci: timeout for tag #

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 26/01/23 19:25, sebastian.kloska@xxxxxxxx wrote:
> Am 23.01.2023 11:46, schrieb Adrian Hunter:
>> On 21/01/23 15:37, sebastian.kloska@xxxxxxxx wrote:
>>> Am 20.01.2023 13:08, schrieb Adrian Hunter:
>>>> On 20/01/23 13:14, sebastian.kloska@xxxxxxxx wrote:
>>>>> So no mail with attachment on linux-mmc@xxxxxxxxxxxxxxx
>>>>> but I linked the dmesgs on https://bugzilla.kernel.org/show_bug.cgi?id=216255 <https://bugzilla.kernel.org/show_bug.cgi?id=216255>
>>>>
>>>> I did receive the logs but it is good to attach them to the bug anyway.
>>>>
>>>>>  
>>>>> Am 19.01.2023 11:28, schrieb Adrian Hunter:
>>>>>> On 19/01/23 11:46, sebastian.kloska@xxxxxxxx <mailto:sebastian.kloska@xxxxxxxx> wrote:
>>>>>>> Am 19.01.2023 10:15, schrieb Adrian Hunter:
>>>>>>>> On 18/01/23 23:24, sebastian.kloska@xxxxxxxx <mailto:sebastian.kloska@xxxxxxxx> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Hardware: Acer Swift-1 SF114-34-P6U1
>>>>>>>>> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4
>>>>>>>>> eMMC: Kingston A29128
>>>>>>>>>
>>>>>>>>> * A couple of seconds after boot access to the eMMC is completely
>>>>>>>>>   blocked. Can be triggered by extensive I/O (e.g, with badblocks -w)
>>>>>>>>>
>>>>>>>>> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1
>>>>>>>>>   Latest highest version including the bug is a fedora 6.1.6
>>>>>>>>
>>>>>>>> There is very little difference in drivers/mmc between 5.14.21
>>>>>>>> and 5.15.1. Are you sure the same issue is not present in
>>>>>>>> 5.14.21?
>>>>>>>
>>>>>>> Pretty sure. Using 5.14.21 day in day out 5.15.1 fails on me after ~10min
>>>>>>> The "slowed down by pr_debug" version sometimes lives for >2h (without stress
>>>>>>> via badblocks etc)
>>>>>>>
>>>>> So all of the following is now done on the latest stable, Which currently
>>>>> is 6.2.0-rc4+
>>>>
>>>> That is a release candidate (rc).  The latest stable kernel
>>>> is 6.1.7.
>>>>
>>>
>>> I know -just thought I should pull master/HEAD from the latest stable
>>>
>>>>>>>>
>>>>>>>> You could try disabling runtime PM to see if that helps:
>>>>>>>>
>>>>>>>>     echo on > /sys/class/mmc_host/mmc0/device/power/control
>>>>>>>>
>>>>> Didn't help
>>>>>
>>>>>>> Will try
>>>>>>>> Also it should be possible to disable cqhci by adding to the kernel
>>>>>>>> command line:
>>>>>>>>
>>>>>>>>     sdhci.debug_quirks=0x40a0000
>>>>>>>>
>>>>>
>>>>> Weird behavior: Seems to make it worse. System often freezes
>>>>> during early boot-up.
>>>>>
>>>>> HOWEVER: If I slow down the driver via dynamic debugging it seems to be stable
>>>>> Failed to crash it over a couple of hours. You can say it fixes the issue if you
>>>>> are willing to accept that 80% of one CPU core is busy with logging ;-)
>>>>
>>>> That is interesting.  Logging can have more side effects than you might expect.
>>>> For example the logs might be written back to storage by syslog, which causes
>>>> more logging, and more writing etc etc.  That in turn is bumping up the CPU
>>>> load.
>>>>
>>>> Increasing the CPU load will decrease the chance of the CPU entering a low-power
>>>> state (C-state).
>>>>
>>>> And there was an issue with C-state transitions affecting Intel SDHCI
>>>> controllers,
>>>> so we need to check that possibility.  You could try kernel command line option:
>>>>
>>>>     intel_idle.max_cstate=1
>>>>
>>>
>>> This -- Sir -- works just wonderfully !!
>>>
>>> Don't see any changes in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable,but
>>> may be that's expected. Kernel runs just fine. works for stable/6.1.7 and fedora/6.1.6
>>> So for me personally this workaround is fine. Guess it will hurt my
>>> power consumption/battery live though but that's acceptable for me.
>>>
>>> Don't hesitate to contact if I should test patches for you. The source
>>> is checked out and ccache is filled.
>>
>> It is probably possible to make a kernel patch to prevent
>> low-power C-States only when eMMC is in use.  I will check
>> on that.
>>
>> In the meantime, could you send an acpidump e.g.
>>
>>     sudo apt-get install acpica-tools
>>     sudo acpidump -o /tmp/acpidump.txt
>>
>> Also do you know if you have an up to date BIOS?
> 
> So this is pretty embarrassing. Finally found the means to update my
> BIOS and turned out to fix the problem.
> Should have been the first on my list to solve the issue.
> THX for the effort

That's cool, no worries.  I would also be interested to get
an acpidump from the updated machine.

>>
>>>
>>>> You can see the C-states and whether they are disabled:
>>>>
>>>>     grep -H . /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
>>>>
>>>> "disable" is writable by root, so you can disable them that way also.
>>>>
>>>> In the meantime I will take a look at the logs.
>>>>
>>>>>
>>>>>>> Will try
>>>>>>>>>
>>>>>>>>> * Spiked the code with a lot of pr_debug messages. As a side effect the driver
>>>>>>>>>   seems to be much more stable, so to me it appears like a timing issue where
>>>>>>>>>   the driver fails to wait for a specific state change which seems to appear
>>>>>>>>>   less often when  slowed down by logging
>>>>>>>>
>>>>>>>> Ideally it would be good to see a full log with dynamic
>>>>>>>> debugging enabled for the latest kernel.
>>>>>>>
>>>>>>> Latest torvalds/linux or stable/linux ?
>>>>>>
>>>>>> Stable would be fine.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>     To enable mmc debug via kernel command line:
>>>>>>>>
>>>>>>>>         dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* +p"
>>>>>>>>
>>>>>
>>>>> This for whatever reason did not work for me on the kernel cmdline
>>>>> but I could trigger it via echo "..." ><debugfs>/dynamic_debug/control
>>>>> so the attached dmesgs are missing early logs.
>>>>>
>>>>>>>>     Kernel must be configured:
>>>>>>>>
>>>>>>>>         CONFIG_DYNAMIC_DEBUG=y
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Would love to have a decent reference documentation to the SD Controller
>>>>>>>>>   especially a detailed description of the behavior of the register map in
>>>>>>>>>  mmc/host/cqhci.h All intel documents I could find seem to describe
>>>>>>>>>  similar but not quite identical register maps
>>>>>>>>
>>>>>>>> The JEDEC eMMC specification describes the CQHCI registers.
>>>>>>>> It looks like you need to register to get access to them
>>>>>>>> at www.jedec.org.  <http://www.jedec.org. ;> AFAICT older specs, which would be just
>>>>>>>> fine, are free to download after registering.
>>>>>>>>
>>>>>
>>>>> Uhh -- but the price for registration is hefty. THX anyway
>>>>>
>>>>>>> THX
>>>>>>>> You can get SD Host Controller Simplified Specification
>>>>>>>> from www.sdcard.org <http://www.sdcard.org>.
>>>>>>>>
>>>>>
>>>>> I attach to dmesg logs
>>>>> * One with dynamic debugging and without sdhci.debug_quirks=0x40a0000
>>>>> * The other without dynamic debugging (as said I failed to crash the system) and sdhci.debug_quirks=0x40a0000
>>>>>
>>>>> Hope I filtered properly for the relevant stuff
>>>>> The second one contains what seems to me a ton of kernel Oops but
>>>>> that might be due to the preliminary status of 6.2.0-rc4
>>>>>
>>>>> THX
>>>>>
>>>>>>> THX
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Sebastian
>>>>>>>>>
>>>>>>>>> * dmesg
>>>>>>>>>
>>>>>>>>> [  347.583082] mmc0: cqhci: timeout for tag 3, qcnt
>>>>>>>>
>>>>>>>> The timeout is pretty long, so it seems like the controller
>>>>>>>> has gotten stuck.  It won't reset the command or data
>>>>>>>> circuits either, which is impossible if it is behaving
>>>>>>>> correctly.
>>>>>>>>
>>>>>>>>> [  347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
>>>>>>>>> [  347.583114] mmc0: cqhci: Caps:      0x000030c0 | Version:  0x00000510
>>>>>>>>> [  347.583117] mmc0: cqhci: Config:    0x00001101 | Control:  0x00000000
>>>>>>>>> [  347.583120] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 0x00000016
>>>>>>>>> [  347.583123] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 0x00000000
>>>>>>>>> [  347.583157] mmc0: cqhci: TDL base:  0x7c2b5000 | TDL up32: 0x00000001
>>>>>>>>> [  347.583160] mmc0: cqhci: Doorbell:  0x0000000f | TCN:      0x00000000
>>>>>>>>> [  347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 0x00000008
>>>>>>>>> [  347.583166] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     0x00010008
>>>>>>>>> [  347.583169] mmc0: cqhci: SSC2:      0x00000001 | DCMD rsp: 0x00000800
>>>>>>>>> [  347.583172] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    0x00000000
>>>>>>>>> [  347.583175] mmc0: cqhci: Resp idx:  0x0000002f | Resp arg: 0x00000900
>>>>>>>>> [  347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>>>>>> [  347.583182] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>>>>>> [  347.583185] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>>>>>> [  347.583188] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>>>>>> [  347.583191] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>>>>>> [  347.583194] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>>>>>> [  347.583197] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>>>>>> [  347.583200] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>>>>>> [  347.583203] mmc0: sdhci: Int enab:  0x02ff4000 | Sig enab: 0x02ff4000
>>>>>>>>> [  347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>>>>>> [  347.583209] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>>>>>> [  347.583212] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>>>>>> [  347.583214] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>>>>>> [  347.583217] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>>>>>> [  347.583219] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>>> [  347.583223] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>>>>>> [  347.583224] mmc0: sdhci: ============================================
>>>>>>>>> [  347.583229] mmc0: running CQE recovery
>>>>>>>>> [  347.690108] mmc0: Reset 0x2 never completed.
>>>>>>>>> [  347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>>>>>> [  347.690115] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>>>>>> [  347.690118] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>>>>>> [  347.690121] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>>>>>> [  347.690124] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>>>>>> [  347.690127] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>>>>>> [  347.690130] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>>>>>> [  347.690133] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>>>>>> [  347.690136] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>>>>>>>> [  347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>>>>>> [  347.690141] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>>>>>> [  347.690144] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>>>>>> [  347.690147] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>>>>>> [  347.690150] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>>>>>> [  347.690152] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>>> [  347.690156] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>>>>>> [  347.690157] mmc0: sdhci: ============================================
>>>>>>>>> [  347.790190] mmc0: Reset 0x4 never completed.
>>>>>>>>> [  347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>>>>>> [  347.790193] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>>>>>> [  347.790196] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>>>>>> [  347.790199] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>>>>>> [  347.790202] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>>>>>> [  347.790205] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>>>>>> [  347.790208] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>>>>>> [  347.790211] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>>>>>> [  347.790213] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>>>>>>>> [  347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>>>>>> [  347.790219] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>>>>>> [  347.790222] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>>>>>> [  347.790225] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>>>>>> [  347.790228] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>>>>>> [  347.790230] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>>> [  347.790234] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>>>>>> [  347.790234] mmc0: sdhci: ============================================
>>>>>>>>> [  347.803056] mmc0: Controller never released inhibit bit(s).
>>>>>>>>> [  347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>>>>>>>> [  347.803077] mmc0: sdhci: Sys addr:  0x00000080 | Version:  0x00001002
>>>>>>>>> [  347.803080] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000070
>>>>>>>>> [  347.803083] mmc0: sdhci: Argument:  0x00030000 | Trn mode: 0x00000023
>>>>>>>>> [  347.803086] mmc0: sdhci: Present:   0x1fff0106 | Host ctl: 0x0000003c
>>>>>>>>> [  347.803089] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>>>>>>>> [  347.803092] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
>>>>>>>>> [  347.803095] mmc0: sdhci: Timeout:   0x0000000e | Int stat: 0x00000000
>>>>>>>>> [  347.803098] mmc0: sdhci: Int enab:  0x00ff0003 | Sig enab: 0x00ff0003
>>>>>>>>> [  347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>>>>>>>> [  347.803103] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x80000807
>>>>>>>>> [  347.803106] mmc0: sdhci: Cmd:       0x00002f3a | Max curr: 0x00000000
>>>>>>>>> [  347.803109] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
>>>>>>>>> [  347.803112] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>>>>>>>> [  347.803114] mmc0: sdhci: Host ctl2: 0x0000000d
>>>>>>>>> [  347.803118] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000011eef1218
>>>>>>>>> [  347.803119] mmc0: sdhci: ============================================
>>>>>>>>>  --
>>>>>>>>>
>>>>> -- 
>>>>>
>>>
> 




[Index of Archives]     [Linux Memonry Technology]     [Linux USB Devel]     [Linux Media]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux