Hi Kamal, Kamal Dasu <kdasu.kdev@xxxxxxxxx> wrote on Mon, 15 Jun 2020 11:11:00 -0400: > On Mon, Jun 15, 2020 at 3:19 AM Miquel Raynal <miquel.raynal@xxxxxxxxxxx> wrote: > > > > Hi Kamal, > > > > Kamal Dasu <kdasu.kdev@xxxxxxxxx> wrote on Fri, 12 Jun 2020 12:34:22 > > -0400: > > > > > On Fri, Jun 12, 2020 at 3:07 AM Miquel Raynal <miquel.raynal@xxxxxxxxxxx> wrote: > > > > > > > > Hi Kamal, > > > > > > > > Kamal Dasu <kdasu.kdev@xxxxxxxxx> wrote on Thu, 11 Jun 2020 12:04:29 > > > > -0400: > > > > > > > > > On Thu, Jun 11, 2020 at 3:27 AM Miquel Raynal <miquel.raynal@xxxxxxxxxxx> wrote: > > > > > > > > > > > > Hi Kamal, > > > > > > > > > > > > Kamal Dasu <kdasu.kdev@xxxxxxxxx> wrote on Thu, 11 Jun 2020 01:44:54 > > > > > > -0400: > > > > > > > > > > > > > Implemented ECC correctable and uncorrectable error handling for EDU > > > > > > > > > > > > Implement? > > > > > > > > > > > > > reads. If ECC correctable bitflips are encountered on EDU transfer, > > > > > > > > > > > > extra space ^ > > > > > > > > > > > > > read page again using pio, This is needed due to a controller lmitation > > > > > > > > > > > > s/pio/PIO/ > > > > > > > > > > > > > where read and corrected data is not transferred to the DMA buffer on ECC > > > > > > > errors. This holds true for ECC correctable errors beyond set threshold. > > > > > > > > > > > > error. > > > > > > > > > > > > Not sure what the last sentence means? > > > > > > > > > > > > > > > > NAND controller allows for setting a correctable ECC threshold number > > > > > of bits beyond which it will actually report the error to the driver. > > > > > e.g. for BCH-4 the threshold is 3, so 3-bit and 4-bit errors will > > > > > generate correctable ECC interrupt however 1-bit and 2-bit errors will > > > > > be corrected silently. > > > > > From the above example EDU hardware will not transfer corrected data > > > > > to the DMA buffer for 3-bit and 4-bit errors that get reported. So > > > > > once we detect > > > > > the error duing EDU we read the page again using pio. > > > > > > > > Ok I see what you mean, can't you fake the threshold instead? The NAND > > > > controller in Linux is not supposed to handle this threshold, the NAND > > > > core is in charge. So what the controller driver should do is just: > > > > increase the number of bitflips + return the maximum number or bitflip > > > > or increase the failure counter. Is this already the case? > > > > > > > /* threshold = ceil(BCH-level * 0.75) */ > > > brcmnand_wr_corr_thresh(host, DIV_ROUND_UP(chip->ecc.strength * 3, 4)); > > > This how the threshold is set, all it means is that for high BCH > > > levels don't interrupt on low number (less than threshold) of > > > bit_flips. Yes the controller driver only increments correctable ECC > > > count. But due the EDU design an EDU operation is disrupted when the > > > controller interrupts on correctable ECC errors during subpage ECC > > > calculations. Hence the driver needs to read the page again with PIO > > > to transfer corrected data. > > > > IIUC, you are doing the job twice: you should just return a number of > > bitflips or an error to the NAND core. So that's why I'm telling that > > you should get rid of this threshold. It would avoid the need for the > > PIO transfer too. > > I think you are reading some statements in isolation that probably are > causing some confusion. > EDU design has a flaw in case of reported ECC error interrupt in that > corrected data is not transferred to the DMA buffer. The PIO is needed > to read corrected data into the NAND data buffer and only for the > reported errors. So there is no need to change the threshold > calculation logic, if we get rid of the threshold then we will have to > do the PIO read on any correctable bit error if it occurs during EDU > reads. > > > > > You also say that the controller "only increments correctable ECC > > count", what do you mean exactly? > > Maybe that statement was a bit misleading. To be clear when an ECC > error is reported the controller gives the bit_flips count as well as > updates the ECC error address Register and ecc error status registers. > This logic works as expected in the hardware. > > >The controller does not report errors > > when the number of bitflips happens to be above the BCH threshold? This > > would be the only case where what is currently done would be actually > > needed though. > > It's the other way. The controller only reports bit errors beyond > >=threshold value, will not report otherwise and silently correct the > data. There is no problem in cases where erros are corrected > silently. Now ECC (un)correctable on EDU reads are detected by simply > reading back the ECC Error address register. And in case of reported > uncorrectable ECC errors are treated as usual. And for reported > correctable ECC errors we need to read the page again using PIO so > that the corrected data is properly transferred. All this applies to > EDU transfer only. > Thank you very much for the explanation, I understand better how this controller works now. Cheers, Miquèl ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/