RE: [PATCH v3 01/11] mtd: cfi_cmdset_0002: Change do_write_oneword() to use chip_good()

"Tokunori Ikegami" <ikegami_to@xxxxxxxxxxx> · Wed, 30 Jan 2019 02:15:15 +0900

Hi Jocke-san,

I have just confirmed the issue behavior again below as based on your advice.
So it seems that the issue behavior is not depended on the maximum suspend and resume cycles number mentioned by you.

1. On OpenWrt code the erasing state is not suspended since it is changed to sleep always for the erasing state in get_chip().
     Note: The issue behavior is caused by reverting the change to sleep for the erasing.
2. Additionally tried to change to not suspend for the writing state also but still the issue has been caused.
3. Added 1 us delay after the reset to retry to write one word but the issue is caused.
4. Added the unlock bypass reset to retry to write one word to make sure the issue behavior situation but the issue is caused.

It seems that the reset to retry to write one word causes the write error issue behavior that seems not able to be recovered.
Also the current checking for chip ready seems not enough so the reset is repeated about 1,000 times then the issue happened.

Now still the chip good checking seems correct to use until the time out to write one word.

> > > > I have lost track of all the details regarding this issue. I just
> want
> > to
> > > > add:
> > > >
> > > > There is a max number of suspend/resume cycles one can do during an
> > > > erase(possibly also for write)
> > > > and once that number is hit you get an error. One way to avoid this
> > could
> > > > be to
> > > > wait after each resume until the erase has started again(look in status
> > > > register)
> > > > before continuing.
> > > >
> > > > If this has anything to do with this problem, I do not know.

Regards,
Ikegami

> -----Original Message-----
> From: Tokunori Ikegami [mailto:ikegami_to@xxxxxxxxxxx]
> Sent: Wednesday, January 23, 2019 9:34 PM
> To: 'Joakim Tjernlund'; 'boris.brezillon@xxxxxxxxxxx'
> Cc: 'linux-mtd@xxxxxxxxxxxxxxxxxxx';
> 'chris.packham@xxxxxxxxxxxxxxxxxxx'; 'fbettoni@xxxxxxxxx';
> 'nbd@xxxxxxxx'; 'stable@xxxxxxxxxxxxxxx'; 'hauke@xxxxxxxxxx';
> 'koen.vandeputte@xxxxxxxxxxxx'; 'boris.brezillon@xxxxxxxxxxxxxxxxxx'
> Subject: RE: [PATCH v3 01/11] mtd: cfi_cmdset_0002: Change
> do_write_oneword() to use chip_good()
> 
> Thanks for your quick response.
> Noted it so I will try to confirm your advice.
> 
> Regards,
> Ikegami
> 
> > -----Original Message-----
> > From: Joakim Tjernlund [mailto:Joakim.Tjernlund@xxxxxxxxxxxx]
> > Sent: Wednesday, January 23, 2019 9:13 PM
> > To: boris.brezillon@xxxxxxxxxxx; ikegami_to@xxxxxxxxxxx
> > Cc: linux-mtd@xxxxxxxxxxxxxxxxxxx; chris.packham@xxxxxxxxxxxxxxxxxxx;
> > fbettoni@xxxxxxxxx; nbd@xxxxxxxx; stable@xxxxxxxxxxxxxxx;
> > hauke@xxxxxxxxxx; koen.vandeputte@xxxxxxxxxxxx;
> > boris.brezillon@xxxxxxxxxxxxxxxxxx
> > Subject: Re: [PATCH v3 01/11] mtd: cfi_cmdset_0002: Change
> > do_write_oneword() to use chip_good()
> >
> > On Wed, 2019-01-23 at 20:56 +0900, Tokunori Ikegami wrote:
> > >
> > >
> > > Hi Jocke-san,
> > >
> > > Thanks for your advice.
> > >
> > > To make sure let me confirm below.
> > >
> > > The OpenWrt code includes your patch below.
> > >
> > >   f69cd2d30a80 2018-05-01 12:58:18 -0700 mtd: cfi: cmdset_0002: Do not
> > allow read/write to suspend erase block.  [Joakim Tjernlund]
> > >
> > > Do you mean that it is possible to be needed an additional change more
> > based on this?
> >
> > That patch resolves another problem. I have not sent a patch for problem
> > I mentioned in this mail.
> >
> > > Or is it not related to the patch fixed by you?
> > >   Note: Sorry now I am not able to check the patches to try sent by
> you
> > before.
> >
> > NP
> >    Jocke
> >
> > >
> > > > I have lost track of all the details regarding this issue. I just
> want
> > to
> > > > add:
> > > >
> > > > There is a max number of suspend/resume cycles one can do during an
> > > > erase(possibly also for write)
> > > > and once that number is hit you get an error. One way to avoid this
> > could
> > > > be to
> > > > wait after each resume until the erase has started again(look in status
> > > > register)
> > > > before continuing.
> > > >
> > > > If this has anything to do with this problem, I do not know.
> > > >
> > > >  Jocke
> > >
> > > Regards,
> > > Ikegami
> > >
> > > > -----Original Message-----
> > > > From: Joakim Tjernlund [mailto:Joakim.Tjernlund@xxxxxxxxxxxx]
> > > > Sent: Wednesday, January 23, 2019 1:58 AM
> > > > To: boris.brezillon@xxxxxxxxxxx; ikegami_to@xxxxxxxxxxx
> > > > Cc: linux-mtd@xxxxxxxxxxxxxxxxxxx;
> > chris.packham@xxxxxxxxxxxxxxxxxxx;
> > > > fbettoni@xxxxxxxxx; nbd@xxxxxxxx; stable@xxxxxxxxxxxxxxx;
> > > > hauke@xxxxxxxxxx; koen.vandeputte@xxxxxxxxxxxx;
> > > > boris.brezillon@xxxxxxxxxxxxxxxxxx
> > > > Subject: Re: [PATCH v3 01/11] mtd: cfi_cmdset_0002: Change
> > > > do_write_oneword() to use chip_good()
> > > >
> > > > On Wed, 2019-01-23 at 00:49 +0900, Tokunori Ikegami wrote:
> > > > >
> > > > > Hi Boris-san,
> > > > >
> > > > > Very sorry for too late to update about this.
> > > > > But could you please let me consult below about this patch?
> > > > >
> > > > > I have tried to investigate the issue root cause and confirmed below
> > but
> > > > > still the root cause seems not clear.
> > > > >
> > > > >   1. Without the change the write oneword is retried and recovered
> > by
> > > > the
> > > > > current existing chip_good() checking.
> > > > >      But after the 1,001 times recovery it was continued to fail
> > recovery
> > > > > with the 3 times retry.
> > > >
> > > > I have lost track of all the details regarding this issue. I just
> want
> > to
> > > > add:
> > > >
> > > > There is a max number of suspend/resume cycles one can do during an
> > > > erase(possibly also for write)
> > > > and once that number is hit you get an error. One way to avoid this
> > could
> > > > be to
> > > > wait after each resume until the erase has started again(look in status
> > > > register)
> > > > before continuing.
> > > >
> > > > If this has anything to do with this problem, I do not know.
> > > >
> > > >  Jocke
> > > >
> > > > >   2. By the patch change the recovery failure can be avoided and
> the
> > write
> > > > > oneword works correctly without any failure.
> > > > >      There are different from the original chip_good() checking
> as
> > the
> > > > > original code resets the chip before the retry.
> > > > >      The patch change wait the chip_good() status until the timeout
> > expiry
> > > > > without the chip reset.
> > > > >        Note: There is a different from the original OpenWrt patch
> > and
> > > > needed
> > > > > to be changed as same and it will be done by the next v4 patch.
> > > > >
> > > > >   3. To narrow down the cause I have added some delays into the
> original
> > > > > code to check the chip ready and good status.
> > > > >      But the failure behavior was not changed so it seems that the
> > issue
> > > > is
> > > > > not depended to the timing. (But not sure)
> > > > >
> > > > >   4. On the OpenWrt the write buffer is disabled but to narrow down
> > the
> > > > > issue I have changed to enable the write buffer.
> > > > >      Then the flash error was not happened by the write buffer
> > operation
> > > > so
> > > > > it seems that the flash driver works correctly.
> > > > >      But another issue was caused and it is similar issue with the
> > original
> > > > > OpenWrt behavior with the patch change.
> > > > >        Note: On the original OpenWrt needs to wait the file system
> > > > > completion to build but it was not finished with the write buffer.
> > (But
> > > > not
> > > > > sure about this behavior)
> > > > >
> > > > > Do you have any comment about this result?
> > > > >
> > > > > If you can agree about the patch change basically with the current
> > > > situation
> > > > > I will do send the v4 patch set later with fix for the comments.
> > > > >
> > > > > But it seems that it is difficult to investigate the root cause
> more
> > at
> > > > this
> > > > > moment to me.
> > > > > Since but the behavior looks depended on the flash chip hardware
> > behavior
> > > > > and I cannot debug the hardware behavior more I think.
> > > > >   Note: Now I can reproduce the flash error issue behavior on the
> > OpenWrt
> > > > > unit.
> > > > >
> > > > > > > >     It is depended on the actual flash chip behavior so the
> > root
> > > > cause
> > > > > > > is
> > > > > > > >     unknown.
> > > > > > >
> > > > > > > Yes, and that's what I'd like you to figure out, or at least
> have
> > > > a
> > > > > > > good idea why this doesn't work on some chips but works on others.
> > > > > >
> > > > > > I see.
> > > > > > But it is a little bit difficult situation since I do not have
> the
> > failure
> > > > > > environment locally at this moment.
> > > > > > But if needed I may ask to get the help for this to Fabio-san.
> > > > >
> > > > > Regards,
> > > > > Ikegami
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: IKEGAMI Tokunori [mailto:ikegami@xxxxxxxxxxxxxxxxxxxx]
> > > > > > Sent: Tuesday, November 6, 2018 6:47 PM
> > > > > > To: Boris Brezillon; ikegami_to@xxxxxxxxxxx
> > > > > > Cc: boris.brezillon@xxxxxxxxxxxxxxxxxx; Felix Fietkau; Hauke
> > Mehrtens;
> > > > > > stable@xxxxxxxxxxxxxxx; Joakim Tjernlund; PACKHAM Chris;
> > > > > > linux-mtd@xxxxxxxxxxxxxxxxxxx; Koen Vandeputte; Fabio Bettoni
> > > > > > Subject: RE: [PATCH v3 01/11] mtd: cfi_cmdset_0002: Change
> > > > > > do_write_oneword() to use chip_good()
> > > > > >
> > > > > > Sorry let me resend the mail below by changing the email address
> > of
> > > > > > Felix-san.
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: IKEGAMI Tokunori
> > > > > > Sent: Tuesday, November 06, 2018 6:37 PM
> > > > > > To: 'Boris Brezillon'; 'ikegami_to@xxxxxxxxxxx'
> > > > > > Cc: boris.brezillon@xxxxxxxxxxxxxxxxxx; Felix Fietkau; Hauke
> > Mehrtens;
> > > > > > stable@xxxxxxxxxxxxxxx; Joakim Tjernlund; PACKHAM Chris;
> > > > > > linux-mtd@xxxxxxxxxxxxxxxxxxx; Koen Vandeputte; Fabio Bettoni
> > > > > > Subject: RE: [PATCH v3 01/11] mtd: cfi_cmdset_0002: Change
> > > > > > do_write_oneword() to use chip_good()
> > > > > >
> > > > > > Hi Boris-san,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: stable-owner@xxxxxxxxxxxxxxx
> > > > > > > [mailto:stable-owner@xxxxxxxxxxxxxxx] On Behalf Of Boris
> Brezillon
> > > > > > > Sent: Tuesday, November 06, 2018 5:34 PM
> > > > > > > To: IKEGAMI Tokunori
> > > > > > > Cc: boris.brezillon@xxxxxxxxxxxxxxxxxx; Felix Fietkau; Hauke
> > > > Mehrtens;
> > > > > > > stable@xxxxxxxxxxxxxxx; Joakim Tjernlund; PACKHAM Chris;
> > > > > > > linux-mtd@xxxxxxxxxxxxxxxxxxx; Koen Vandeputte; Fabio Bettoni
> > > > > > > Subject: Re: [PATCH v3 01/11] mtd: cfi_cmdset_0002: Change
> > > > > > > do_write_oneword() to use chip_good()
> > > > > > >
> > > > > > > Hi IKEGAMI,
> > > > > > >
> > > > > > > On Tue, 6 Nov 2018 00:25:43 +0000
> > > > > > > IKEGAMI Tokunori <ikegami@xxxxxxxxxxxxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > > > > Also the issue can be fixed by using chip_good() instead
> > of
> > > > > > > chip_ready().
> > > > > > > > > > The chip_ready() just checks the value from flash memory
> > twice.
> > > > > > > > > > And the chip_good() checks the value with the expected
> value.
> > > > > > > > > > Probably the issue can be fixed as checked correctly by
> > the
> > > > > > chip_good().
> > > > > > > > > > So change to use chip_good() instead of chip_ready().
> > > > > > > > >
> > > > > > > > > Well, that's not really explaining why you think chip_good()
> > should
> > > > > > > be
> > > > > > > > > used instead of chip_ready(). So I went on and looked at
> the
> > > > > > > > > chip_good(), chip_ready() and do_write_oneword()
> > implementation,
> > > > and
> > > > > > > > > also looked at users of do_write_oneword(). It seems this
> > function
> > > > > > is
> > > > > > > > > used to write data to the flash, and apparently the "one
> bit
> > should
> > > > > > > > > toggle to reflect a busy state" does not apply when writing
> > things
> > > > > > to
> > > > > > > > > the memory array (it's probably working for other CFI commands,
> > > > but
> > > > > > > I
> > > > > > > > > guess it takes more time to actually change the level of
> a
> > NOR
> > > > cell,
> > > > > > > > > hence the result of 2 identical reads does not mean that
> the
> > write
> > > > > > is
> > > > > > > > > done).
> > > > > > > > >
> > > > > > > > > Also, it seems that cmdset_0001 is not implementing
> chip_ready()
> > > > the
> > > > > > > > > same way, and I wonder if cmdset_0002 implementation is
> correct
> > > > to
> > > > > > > > > start with. Or maybe I don't get what chip_ready() is for.
> > > > > > > > >
> > > > > > > > > Anyway, this is the sort of clarification I'd like to have.
> > > > > > > >
> > > > > > > > I am thinking to update the commit message as below.
> > > > > > > >
> > > > > > > >     mtd: cfi_cmdset_0002: Use chip_good() to retry in
> > > > > > do_write_oneword()
> > > > > > > >     As reported by the OpenWRT team, write requests sometimes
> > fail
> > > > on
> > > > > > > some
> > > > > > > >     platforms.
> > > > > > > >     Currently to check the state chip_ready() is used correctly
> > > > as
> > > > > > > described by
> > > > > > > >     the flash memory S29GL256P11TFI01 datasheet.
> > > > > > >
> > > > > > > I had a look at the S29GL256P datasheet here [1], and if I'm
> correct,
> > > > > > > it's using cmdset 0001.
> > > > > >
> > > > > > No actually the cmdset 0002 is used on the flash chip.
> > > > > > The manufacturer ID xx01h and Device ID 2201h are used to decide.
> > > > > >
> > > > > > There is information from Fobis-san below also about this.
> > > > > >
> > > > > > On forum thread musashino posted picture of flash chip:
> > > > > >
> > > >
> >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fforum
> > >
> >
> > .openwrt.org%2Ft%2Fimpossible-to-install-update-any-packages-&amp;da
> > ta
> > > >
> >
> =02%7C01%7Cjoakim.tjernlund%40infinera.com%7C916af968b27a402da8cf08d68
> > > >
> >
> 0812fce%7C285643de5f5b4b03a1530ae2dc8aaf77%7C1%7C1%7C63683768968012655
> > > >
> >
> 7&amp;sdata=NNGSYgq1VTuofPPMMlyKIm9W1DJHQFw0s94Ernq5cts%3D&amp;reserve
> > > > d=0
> > > > > > on-wzr-hp-g300nh-18-06-1
> > > > > >
> > > >
> >
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.cy
> > > >
> >
> press.com%2Fpart%2Fs29gl256p11tfi010&amp;data=02%7C01%7Cjoakim.tjernlu
> > > >
> >
> nd%40infinera.com%7C916af968b27a402da8cf08d680812fce%7C285643de5f5b4b0
> > > >
> >
> 3a1530ae2dc8aaf77%7C1%7C1%7C636837689680126557&amp;sdata=Twk1VUEESz14U
> > > > pdJjU4ohuhiQ5jN1uHLh0cAhlAznW0%3D&amp;reserved=0
> > > > > > [    0.862264] physmap platform flash device: 02000000 at 1e000000
> > > > > > [    0.868331] physmap-flash: Found 1 x16 devices at 0x0 in 16-bit
> > > > > > bank. Manufacturer ID 0x000001 Chip ID 0x002201
> > > > > > [    0.878493] Amd/Fujitsu Extended Query Table at 0x0040
> > > > > > [    0.883668]   Amd/Fujitsu Extended Query version 1.3.
> > > > > > [    0.888768] number of CFI chips: 1
> > > > > > [    0.894557] Searching for RedBoot partition table in
> > physmap-flash
> > > > > > at offset 0x1fc0000
> > > > > > [    0.918009] Searching for RedBoot partition table in
> > physmap-flash
> > > > > > at offset 0x1fe0000
> > > > > > [    0.941464] No RedBoot partition table detected in physmap-flash
> > > > > > [    0.947926] Creating 5 MTD partitions on "physmap-flash":
> > > > > > [    0.953384] 0x000000000000-0x000000040000 : "u-boot"
> > > > > > [    0.960853] 0x000000040000-0x000000060000 : "u-boot-env"
> > > > > > [    0.968803] 0x000000060000-0x000001fc0000 : "firmware"
> > > > > > [    0.981859] 2 uimage-fw partitions found on MTD device firmware
> > > > > > [    0.987900] 0x000000060000-0x0000001b5706 : "kernel"
> > > > > > [    0.994916] 0x0000001b5706-0x000001fc0000 : "rootfs"
> > > > > > [    1.001986] mtd: device 4 (rootfs) set to be root filesystem
> > > > > > [    1.007789] 1 squashfs-split partitions found on MTD device
> rootfs
> > > > > > [    1.014014] 0x0000003c0000-0x000001fc0000 : "rootfs_data"
> > > > > > [    1.022093] 0x000001fc0000-0x000001fe0000 : "user_property"
> > > > > > [    1.030283] 0x000001fe0000-0x000002000000 : "art"
> > > > > >
> > > > > > Maybe you could post links to forum thread, and data sheet.
> > > > > >
> > > > > > > >     Also chip_good() is used to check if the write is succeeded
> > > > and
> > > > > > it
> > > > > > > was
> > > > > > > >     implemented by the commit fb4a90bfcd6d8 ("[MTD] CFI-0002
> > -
> > > > Improve
> > > > > > > error
> > > > > > > >     checking").
> > > > > > > >     But actually the write failure is caused on some platforms
> > and
> > > > > also
> > > > > > > it can
> > > > > > > >     be fixed by using chip_good() to check the state and retry
> > > > > instead.
> > > > > > > Do you know on which NOR chips this happens? Do you have access
> > to
> > > > the
> > > > > > > datasheet?
> > > > > >
> > > > > > But it looks SST49LF008A [3] from the changes below but I am not
> > sure
> > > > at
> > > > > > this moment and probably it should be confirmed to the authr Eric
> > W.
> > > > > > Biedermann <ebiederman@xxxxxxxx> to make sure.
> > > > > >
> > > > > > +#define SST49LF008A            0x005a
> > > > > >
> > > > > >  static int cfi_amdstd_read (struct mtd_info *, loff_t, size_t,
> > size_t
> > > > *,
> > > > > > u_char *);
> > > > > >  static int cfi_amdstd_write_words(struct mtd_info *, loff_t,
> > size_t,
> > > > > > size_t *, const u_char *);
> > > > > > @@ -191,6 +192,7 @@ static struct cfi_fixup cfi_fixup_table[]
> =
> > {
> > > > > >  };
> > > > > >  static struct cfi_fixup jedec_fixup_table[] = {
> > > > > >         { MANUFACTURER_SST, SST49LF004B, fixup_use_fwh_lock,
> > NULL, },
> > > > > > +       { MANUFACTURER_SST, SST49LF008A, fixup_use_fwh_lock,
> > NULL, },
> > > > > >
> > > > > > > >     It is depended on the actual flash chip behavior so the
> > root
> > > > cause
> > > > > > > is
> > > > > > > >     unknown.
> > > > > > >
> > > > > > > Yes, and that's what I'd like you to figure out, or at least
> have
> > > > a
> > > > > > > good idea why this doesn't work on some chips but works on others.
> > > > > >
> > > > > > I see.
> > > > > > But it is a little bit difficult situation since I do not have
> the
> > failure
> > > > > > environment locally at this moment.
> > > > > > But if needed I may ask to get the help for this to Fabio-san.
> > > > > >
> > > > > > > > If any comment please let me know.
> > > > > > > >
> > > > > > > > > > Signed-off-by: Tokunori Ikegami
> > <ikegami@xxxxxxxxxxxxxxxxxxxx>
> > > > > > > > > > Signed-off-by: Hauke Mehrtens <hauke@xxxxxxxxxx>
> > > > > > > > > > Signed-off-by: Koen Vandeputte
> > <koen.vandeputte@xxxxxxxxxxxx>
> > > > > > > > > > Signed-off-by: Fabio Bettoni <fbettoni@xxxxxxxxx>
> > > > > > > > >
> > > > > > > > > Has the patch really gone through all those people? SoB
> is
> > used
> > > > when
> > > > > > > you
> > > > > > > > > apply a patch in your tree or when you're the original author.
> > > > > > > >
> > > > > > > > I have just checked the OpenWRT git log again and it looks
> that
> > > > it was
> > > > > > > originally
> > > > > > > > implemented by Felix Fietkau <nbd@xxxxxxxxxxx> by the patch
> > below
> > > > so
> > > > > > I
> > > > > > > will update the Signed-off-by tag as so.
> > > >
> >
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.
> > > >
> >
> openwrt.org%2F%3Fp%3Dopenwrt%2Fopenwrt.git%3Ba%3Dcommitdiff%3Bh%3D2530
> > > >
> >
> 640&amp;data=02%7C01%7Cjoakim.tjernlund%40infinera.com%7C916af968b27a4
> > > >
> >
> 02da8cf08d680812fce%7C285643de5f5b4b03a1530ae2dc8aaf77%7C1%7C1%7C63683
> > > >
> >
> 7689680126557&amp;sdata=w13ZTKwD1NiUQzxQfUou92KVDlW80qGUiZVIcjU%2BGPA%
> > > > 3D&amp;reserved=0
> > > > > > > f07cd2b3b14fe9ec03fa63a586452cc5f>
> > > > > > > > > > Co-Developed-by: Hauke Mehrtens <hauke@xxxxxxxxxx>
> > > > > > > > > > Co-Developed-by: Koen Vandeputte
> > <koen.vandeputte@xxxxxxxxxxxx>
> > > > > > > > > > Co-Developed-by: Fabio Bettoni <fbettoni@xxxxxxxxx>
> > > > > > > > >
> > > > > > > > > Not sure we want to add new undocumented tags, but you can
> > mention
> > > > > > > > > that all those people helped you find/debug the issue. They
> > can
> > > > also
> > > > > > > > > add their Reviewed-by/Tested-by if they like.
> > > > > > >
> > > > > > > My bad, I just noticed these are valid flags [2], so you can
> keep
> > > > them,
> > > > > > > and according to the doc, you should also keep the SoB.
> > > > > >
> > > > > > I see.
> > > > > > Yes I had also checked it.
> > > > > >
> > > > > > By the way in near future my company email address will be not
> able
> > > > to
> > > > > use.
> > > > > > So I will change the mail address to my personal email address
> [4]
> > after
> > > > > > that or before.
> > > > > >
> > > > > > Regards,
> > > > > > Ikegami
> > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Boris
> > > > > > >
> > > > > > >
> > >

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/