On 26/01/23 19:25, sebastian.kloska@xxxxxxxx wrote: > Am 23.01.2023 11:46, schrieb Adrian Hunter: >> On 21/01/23 15:37, sebastian.kloska@xxxxxxxx wrote: >>> Am 20.01.2023 13:08, schrieb Adrian Hunter: >>>> On 20/01/23 13:14, sebastian.kloska@xxxxxxxx wrote: >>>>> So no mail with attachment on linux-mmc@xxxxxxxxxxxxxxx >>>>> but I linked the dmesgs on https://bugzilla.kernel.org/show_bug.cgi?id=216255 <https://bugzilla.kernel.org/show_bug.cgi?id=216255> >>>> >>>> I did receive the logs but it is good to attach them to the bug anyway. >>>> >>>>> >>>>> Am 19.01.2023 11:28, schrieb Adrian Hunter: >>>>>> On 19/01/23 11:46, sebastian.kloska@xxxxxxxx <mailto:sebastian.kloska@xxxxxxxx> wrote: >>>>>>> Am 19.01.2023 10:15, schrieb Adrian Hunter: >>>>>>>> On 18/01/23 23:24, sebastian.kloska@xxxxxxxx <mailto:sebastian.kloska@xxxxxxxx> wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Hardware: Acer Swift-1 SF114-34-P6U1 >>>>>>>>> SD Host controller: Intel Corporation Device 4dc4 (rev 01) 8086:4dc4 >>>>>>>>> eMMC: Kingston A29128 >>>>>>>>> >>>>>>>>> * A couple of seconds after boot access to the eMMC is completely >>>>>>>>> blocked. Can be triggered by extensive I/O (e.g, with badblocks -w) >>>>>>>>> >>>>>>>>> * Tested with stable branch linux-5.14.21 vs. linux-5.15.1 >>>>>>>>> Latest highest version including the bug is a fedora 6.1.6 >>>>>>>> >>>>>>>> There is very little difference in drivers/mmc between 5.14.21 >>>>>>>> and 5.15.1. Are you sure the same issue is not present in >>>>>>>> 5.14.21? >>>>>>> >>>>>>> Pretty sure. Using 5.14.21 day in day out 5.15.1 fails on me after ~10min >>>>>>> The "slowed down by pr_debug" version sometimes lives for >2h (without stress >>>>>>> via badblocks etc) >>>>>>> >>>>> So all of the following is now done on the latest stable, Which currently >>>>> is 6.2.0-rc4+ >>>> >>>> That is a release candidate (rc). The latest stable kernel >>>> is 6.1.7. >>>> >>> >>> I know -just thought I should pull master/HEAD from the latest stable >>> >>>>>>>> >>>>>>>> You could try disabling runtime PM to see if that helps: >>>>>>>> >>>>>>>> echo on > /sys/class/mmc_host/mmc0/device/power/control >>>>>>>> >>>>> Didn't help >>>>> >>>>>>> Will try >>>>>>>> Also it should be possible to disable cqhci by adding to the kernel >>>>>>>> command line: >>>>>>>> >>>>>>>> sdhci.debug_quirks=0x40a0000 >>>>>>>> >>>>> >>>>> Weird behavior: Seems to make it worse. System often freezes >>>>> during early boot-up. >>>>> >>>>> HOWEVER: If I slow down the driver via dynamic debugging it seems to be stable >>>>> Failed to crash it over a couple of hours. You can say it fixes the issue if you >>>>> are willing to accept that 80% of one CPU core is busy with logging ;-) >>>> >>>> That is interesting. Logging can have more side effects than you might expect. >>>> For example the logs might be written back to storage by syslog, which causes >>>> more logging, and more writing etc etc. That in turn is bumping up the CPU >>>> load. >>>> >>>> Increasing the CPU load will decrease the chance of the CPU entering a low-power >>>> state (C-state). >>>> >>>> And there was an issue with C-state transitions affecting Intel SDHCI >>>> controllers, >>>> so we need to check that possibility. You could try kernel command line option: >>>> >>>> intel_idle.max_cstate=1 >>>> >>> >>> This -- Sir -- works just wonderfully !! >>> >>> Don't see any changes in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable,but >>> may be that's expected. Kernel runs just fine. works for stable/6.1.7 and fedora/6.1.6 >>> So for me personally this workaround is fine. Guess it will hurt my >>> power consumption/battery live though but that's acceptable for me. >>> >>> Don't hesitate to contact if I should test patches for you. The source >>> is checked out and ccache is filled. >> >> It is probably possible to make a kernel patch to prevent >> low-power C-States only when eMMC is in use. I will check >> on that. >> >> In the meantime, could you send an acpidump e.g. >> >> sudo apt-get install acpica-tools >> sudo acpidump -o /tmp/acpidump.txt >> >> Also do you know if you have an up to date BIOS? > > So this is pretty embarrassing. Finally found the means to update my > BIOS and turned out to fix the problem. > Should have been the first on my list to solve the issue. > THX for the effort That's cool, no worries. I would also be interested to get an acpidump from the updated machine. >> >>> >>>> You can see the C-states and whether they are disabled: >>>> >>>> grep -H . /sys/devices/system/cpu/cpu*/cpuidle/state*/disable >>>> >>>> "disable" is writable by root, so you can disable them that way also. >>>> >>>> In the meantime I will take a look at the logs. >>>> >>>>> >>>>>>> Will try >>>>>>>>> >>>>>>>>> * Spiked the code with a lot of pr_debug messages. As a side effect the driver >>>>>>>>> seems to be much more stable, so to me it appears like a timing issue where >>>>>>>>> the driver fails to wait for a specific state change which seems to appear >>>>>>>>> less often when slowed down by logging >>>>>>>> >>>>>>>> Ideally it would be good to see a full log with dynamic >>>>>>>> debugging enabled for the latest kernel. >>>>>>> >>>>>>> Latest torvalds/linux or stable/linux ? >>>>>> >>>>>> Stable would be fine. >>>>>> >>>>>>> >>>>>>>> >>>>>>>> To enable mmc debug via kernel command line: >>>>>>>> >>>>>>>> dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* +p" >>>>>>>> >>>>> >>>>> This for whatever reason did not work for me on the kernel cmdline >>>>> but I could trigger it via echo "..." ><debugfs>/dynamic_debug/control >>>>> so the attached dmesgs are missing early logs. >>>>> >>>>>>>> Kernel must be configured: >>>>>>>> >>>>>>>> CONFIG_DYNAMIC_DEBUG=y >>>>>>>> >>>>>>>>> >>>>>>>>> * Would love to have a decent reference documentation to the SD Controller >>>>>>>>> especially a detailed description of the behavior of the register map in >>>>>>>>> mmc/host/cqhci.h All intel documents I could find seem to describe >>>>>>>>> similar but not quite identical register maps >>>>>>>> >>>>>>>> The JEDEC eMMC specification describes the CQHCI registers. >>>>>>>> It looks like you need to register to get access to them >>>>>>>> at www.jedec.org. <http://www.jedec.org. ;> AFAICT older specs, which would be just >>>>>>>> fine, are free to download after registering. >>>>>>>> >>>>> >>>>> Uhh -- but the price for registration is hefty. THX anyway >>>>> >>>>>>> THX >>>>>>>> You can get SD Host Controller Simplified Specification >>>>>>>> from www.sdcard.org <http://www.sdcard.org>. >>>>>>>> >>>>> >>>>> I attach to dmesg logs >>>>> * One with dynamic debugging and without sdhci.debug_quirks=0x40a0000 >>>>> * The other without dynamic debugging (as said I failed to crash the system) and sdhci.debug_quirks=0x40a0000 >>>>> >>>>> Hope I filtered properly for the relevant stuff >>>>> The second one contains what seems to me a ton of kernel Oops but >>>>> that might be due to the preliminary status of 6.2.0-rc4 >>>>> >>>>> THX >>>>> >>>>>>> THX >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> Sebastian >>>>>>>>> >>>>>>>>> * dmesg >>>>>>>>> >>>>>>>>> [ 347.583082] mmc0: cqhci: timeout for tag 3, qcnt >>>>>>>> >>>>>>>> The timeout is pretty long, so it seems like the controller >>>>>>>> has gotten stuck. It won't reset the command or data >>>>>>>> circuits either, which is impossible if it is behaving >>>>>>>> correctly. >>>>>>>> >>>>>>>>> [ 347.583086] mmc0: cqhci: ============ CQHCI REGISTER DUMP =========== >>>>>>>>> [ 347.583114] mmc0: cqhci: Caps: 0x000030c0 | Version: 0x00000510 >>>>>>>>> [ 347.583117] mmc0: cqhci: Config: 0x00001101 | Control: 0x00000000 >>>>>>>>> [ 347.583120] mmc0: cqhci: Int stat: 0x00000000 | Int enab: 0x00000016 >>>>>>>>> [ 347.583123] mmc0: cqhci: Int sig: 0x00000016 | Int Coal: 0x00000000 >>>>>>>>> [ 347.583157] mmc0: cqhci: TDL base: 0x7c2b5000 | TDL up32: 0x00000001 >>>>>>>>> [ 347.583160] mmc0: cqhci: Doorbell: 0x0000000f | TCN: 0x00000000 >>>>>>>>> [ 347.583163] mmc0: cqhci: Dev queue: 0x00000008 | Dev Pend: 0x00000008 >>>>>>>>> [ 347.583166] mmc0: cqhci: Task clr: 0x00000000 | SSC1: 0x00010008 >>>>>>>>> [ 347.583169] mmc0: cqhci: SSC2: 0x00000001 | DCMD rsp: 0x00000800 >>>>>>>>> [ 347.583172] mmc0: cqhci: RED mask: 0xfdf9a080 | TERRI: 0x00000000 >>>>>>>>> [ 347.583175] mmc0: cqhci: Resp idx: 0x0000002f | Resp arg: 0x00000900 >>>>>>>>> [ 347.583176] mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== >>>>>>>>> [ 347.583182] mmc0: sdhci: Sys addr: 0x00000080 | Version: 0x00001002 >>>>>>>>> [ 347.583185] mmc0: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000070 >>>>>>>>> [ 347.583188] mmc0: sdhci: Argument: 0x00030000 | Trn mode: 0x00000023 >>>>>>>>> [ 347.583191] mmc0: sdhci: Present: 0x1fff0106 | Host ctl: 0x0000003c >>>>>>>>> [ 347.583194] mmc0: sdhci: Power: 0x0000000b | Blk gap: 0x00000080 >>>>>>>>> [ 347.583197] mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x00000007 >>>>>>>>> [ 347.583200] mmc0: sdhci: Timeout: 0x0000000e | Int stat: 0x00000000 >>>>>>>>> [ 347.583203] mmc0: sdhci: Int enab: 0x02ff4000 | Sig enab: 0x02ff4000 >>>>>>>>> [ 347.583206] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000 >>>>>>>>> [ 347.583209] mmc0: sdhci: Caps: 0x546ec881 | Caps_1: 0x80000807 >>>>>>>>> [ 347.583212] mmc0: sdhci: Cmd: 0x00002f3a | Max curr: 0x00000000 >>>>>>>>> [ 347.583214] mmc0: sdhci: Resp[0]: 0x00000900 | Resp[1]: 0x00000000 >>>>>>>>> [ 347.583217] mmc0: sdhci: Resp[2]: 0x00000000 | Resp[3]: 0x00000000 >>>>>>>>> [ 347.583219] mmc0: sdhci: Host ctl2: 0x0000000d >>>>>>>>> [ 347.583223] mmc0: sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x000000011eef1218 >>>>>>>>> [ 347.583224] mmc0: sdhci: ============================================ >>>>>>>>> [ 347.583229] mmc0: running CQE recovery >>>>>>>>> [ 347.690108] mmc0: Reset 0x2 never completed. >>>>>>>>> [ 347.690109] mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== >>>>>>>>> [ 347.690115] mmc0: sdhci: Sys addr: 0x00000080 | Version: 0x00001002 >>>>>>>>> [ 347.690118] mmc0: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000070 >>>>>>>>> [ 347.690121] mmc0: sdhci: Argument: 0x00030000 | Trn mode: 0x00000023 >>>>>>>>> [ 347.690124] mmc0: sdhci: Present: 0x1fff0106 | Host ctl: 0x0000003c >>>>>>>>> [ 347.690127] mmc0: sdhci: Power: 0x0000000b | Blk gap: 0x00000080 >>>>>>>>> [ 347.690130] mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x00000007 >>>>>>>>> [ 347.690133] mmc0: sdhci: Timeout: 0x0000000e | Int stat: 0x00000000 >>>>>>>>> [ 347.690136] mmc0: sdhci: Int enab: 0x00ff0003 | Sig enab: 0x00ff0003 >>>>>>>>> [ 347.690139] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000 >>>>>>>>> [ 347.690141] mmc0: sdhci: Caps: 0x546ec881 | Caps_1: 0x80000807 >>>>>>>>> [ 347.690144] mmc0: sdhci: Cmd: 0x00002f3a | Max curr: 0x00000000 >>>>>>>>> [ 347.690147] mmc0: sdhci: Resp[0]: 0x00000900 | Resp[1]: 0x00000000 >>>>>>>>> [ 347.690150] mmc0: sdhci: Resp[2]: 0x00000000 | Resp[3]: 0x00000000 >>>>>>>>> [ 347.690152] mmc0: sdhci: Host ctl2: 0x0000000d >>>>>>>>> [ 347.690156] mmc0: sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x000000011eef1218 >>>>>>>>> [ 347.690157] mmc0: sdhci: ============================================ >>>>>>>>> [ 347.790190] mmc0: Reset 0x4 never completed. >>>>>>>>> [ 347.790191] mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== >>>>>>>>> [ 347.790193] mmc0: sdhci: Sys addr: 0x00000080 | Version: 0x00001002 >>>>>>>>> [ 347.790196] mmc0: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000070 >>>>>>>>> [ 347.790199] mmc0: sdhci: Argument: 0x00030000 | Trn mode: 0x00000023 >>>>>>>>> [ 347.790202] mmc0: sdhci: Present: 0x1fff0106 | Host ctl: 0x0000003c >>>>>>>>> [ 347.790205] mmc0: sdhci: Power: 0x0000000b | Blk gap: 0x00000080 >>>>>>>>> [ 347.790208] mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x00000007 >>>>>>>>> [ 347.790211] mmc0: sdhci: Timeout: 0x0000000e | Int stat: 0x00000000 >>>>>>>>> [ 347.790213] mmc0: sdhci: Int enab: 0x00ff0003 | Sig enab: 0x00ff0003 >>>>>>>>> [ 347.790216] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000 >>>>>>>>> [ 347.790219] mmc0: sdhci: Caps: 0x546ec881 | Caps_1: 0x80000807 >>>>>>>>> [ 347.790222] mmc0: sdhci: Cmd: 0x00002f3a | Max curr: 0x00000000 >>>>>>>>> [ 347.790225] mmc0: sdhci: Resp[0]: 0x00000900 | Resp[1]: 0x00000000 >>>>>>>>> [ 347.790228] mmc0: sdhci: Resp[2]: 0x00000000 | Resp[3]: 0x00000000 >>>>>>>>> [ 347.790230] mmc0: sdhci: Host ctl2: 0x0000000d >>>>>>>>> [ 347.790234] mmc0: sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x000000011eef1218 >>>>>>>>> [ 347.790234] mmc0: sdhci: ============================================ >>>>>>>>> [ 347.803056] mmc0: Controller never released inhibit bit(s). >>>>>>>>> [ 347.803058] mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== >>>>>>>>> [ 347.803077] mmc0: sdhci: Sys addr: 0x00000080 | Version: 0x00001002 >>>>>>>>> [ 347.803080] mmc0: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000070 >>>>>>>>> [ 347.803083] mmc0: sdhci: Argument: 0x00030000 | Trn mode: 0x00000023 >>>>>>>>> [ 347.803086] mmc0: sdhci: Present: 0x1fff0106 | Host ctl: 0x0000003c >>>>>>>>> [ 347.803089] mmc0: sdhci: Power: 0x0000000b | Blk gap: 0x00000080 >>>>>>>>> [ 347.803092] mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x00000007 >>>>>>>>> [ 347.803095] mmc0: sdhci: Timeout: 0x0000000e | Int stat: 0x00000000 >>>>>>>>> [ 347.803098] mmc0: sdhci: Int enab: 0x00ff0003 | Sig enab: 0x00ff0003 >>>>>>>>> [ 347.803101] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000 >>>>>>>>> [ 347.803103] mmc0: sdhci: Caps: 0x546ec881 | Caps_1: 0x80000807 >>>>>>>>> [ 347.803106] mmc0: sdhci: Cmd: 0x00002f3a | Max curr: 0x00000000 >>>>>>>>> [ 347.803109] mmc0: sdhci: Resp[0]: 0x00000900 | Resp[1]: 0x00000000 >>>>>>>>> [ 347.803112] mmc0: sdhci: Resp[2]: 0x00000000 | Resp[3]: 0x00000000 >>>>>>>>> [ 347.803114] mmc0: sdhci: Host ctl2: 0x0000000d >>>>>>>>> [ 347.803118] mmc0: sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x000000011eef1218 >>>>>>>>> [ 347.803119] mmc0: sdhci: ============================================ >>>>>>>>> -- >>>>>>>>> >>>>> -- >>>>> >>> >