Re: [PATCH] mtd: nand: raw: qcom_nandc: Don't clear_bam_transaction on READID

Konrad Dybcio <konrad.dybcio@xxxxxxxxxxxxxx> · Mon, 31 Jan 2022 20:54:12 +0100

On 31/01/2022 15:13, Sricharan Ramabadhran wrote:
Hi Konrad,

On 1/31/2022 3:39 PM, Konrad Dybcio wrote:

On 28/01/2022 18:50, Sricharan Ramabadhran wrote:
Hi Konrad,

On 1/28/2022 9:55 AM, Sricharan Ramabadhran wrote:
Hi Miquel,

On 1/26/2022 4:12 PM, Miquel Raynal wrote:
Hi Mani,

mani@xxxxxxxxxx wrote on Wed, 26 Jan 2022 16:03:16 +0530:

On Wed, Jan 26, 2022 at 11:16:13AM +0100, Miquel Raynal wrote:
Hello,

miquel.raynal@xxxxxxxxxxx wrote on Fri, 14 Jan 2022 08:27:18 +0100:
Hi Konrad,

konrad.dybcio@xxxxxxxxxxxxxx wrote on Thu, 13 Jan 2022 19:44:26 
+0100:
While I have absolutely 0 idea why and how, running 
clear_bam_transaction
when READID is issued makes the DMA totally clog up and refuse 
to function
at all on mdm9607. In fact, it is so bad that all the data 
gets garbled
and after a short while in the nand probe flow, the CPU 
decides that
sepuku is the only option.

Removing _READID from the if condition makes it work like a 
charm, I can
read data and mount partitions without a problem.

Signed-off-by: Konrad Dybcio <konrad.dybcio@xxxxxxxxxxxxxx>
---
This is totally just an observation which took me an inhumane 
amount of
debug prints to find.. perhaps there's a better reason behind 
this, but
I can't seem to find any answers.. Therefore, this is a BIG RFC!
I'm adding two people from codeaurora who worked a lot on this 
driver.
Hopefully they will have an idea :)
Sadre, I've spent a significant amount of time reviewing your 
patches,
now it's your turn to not take a month to answer to your peers
proposals.

Please help reviewing this patch.
Sorry. I was hoping that Qcom folks would chime in as I don't 
have any idea
about the mdm9607 platform. It could be that the mail server 
migration from
codeaurora to quicinc put a barrier here.

Let me ping them internally.
Oh, ok, I didn't know. Thanks!

   Sorry Miquel, somehow we did not get this email in our inbox.
   Thanks to Mani for pinging us, we will test this up today and 
get back.

      While we could not reproduce this issue on our ipq boards (do 
not have a mdm9607 right now) and
       issue does not look any obvious.
      can you please give the debug logs that you did for the above 
stage by stage ?

I won't have access to the board for about two weeks, sorry.

When I get to it, I'll surely try to send you the logs, though there

wasn't much more than just something jumping to who-knows-where

after clear_bam_transaction was called, resulting in values 
associated with

the NAND being all zeroed out in pr_err/_debug/etc.

    Ok sure. So was the READID command itself failing (or) the 
subsequent one ?
   We can check which parameter reset by the clear_bam_transaction is 
causing the
   failure.  Meanwhile, looping in Pradeep who has access to the 
board, so in a better
   position to debug.

I'm sorry I have so few details on hand, and no kernel tree (no access 
to that machine either, for now).

I will try to describe to the best of my abilities what I recall.

My methodology of making sure things don't go haywire was to print the 
oob size

of our NAND basically every two lines of code (yes, i was very desperate 
at one point),

as that was zeroed out when *the bug* happened, leading to a kernel 
bug/panic/stall

(can't recall what exactly it was, but it said something along the lines 
of "no support for

oob size 0" and then it didn't fail graceully, leading to some bad jumps 
and ultimately

a dead platform..)

after hours of digging, I found out that everything goes fine until 
clear_bam_transaction is called,

after that gets executed every nand op starts reading all zeroes (for 
example in JEDEC ID check)

so I added the changes from this patch, and things magically started 
working... My suspicion is

that the underlying FIFO isn't fully drained (is it a FIFO on 9607? bah, 
i work on too many socs at once)

and this function only makes Linux think it is, without actually 
draining it, and the leftover

commands get executed with some parts of them getting overwritten, 
resulting in the

famous garbage in - garbage out situation, but that's only a guesstimate..

Do note this somehow worked fine on 5.11 and then broke on 5.12/13. I 
went as far as replacing most

of the kernel with the updated/downgraded parts via git checkout (i 
tried many combinations),

to no avail.. I even tried different compilers and optimization levels, 
thinking it could have been

a codegen issue, but no luck either.

I.. do understand this email is a total mess to read, as much as it was 
to write, but

without access to my code and the machine itself I can't give you solid 
details, and

the fact this situation is far from ordinary doesn't help either..

The latest (ancient, not quite pretty, but probably working if my memory 
is correct) version of my patches

for the mdm9607 is available at [1], I will push the new revision after 
I get access to the workstation.

Konrad

[1] https://github.com/SoMainline/linux/commits/konrad/pinemodem

Regards,
   Sricharan