Re: [PATCH] mtd: spi-nor: only apply reset hacks to broken hardware

NeilBrown <neilb@xxxxxxxx> · Wed, 01 Aug 2018 10:40:12 +1000

On Wed, Aug 01 2018, Marek Vasut wrote:

> On 07/31/2018 10:12 PM, Boris Brezillon wrote:
>> On Tue, 31 Jul 2018 11:05:11 +1000
>> NeilBrown <neilb@xxxxxxxx> wrote:
>> 
>>> On Fri, Jul 27 2018, Boris Brezillon wrote:
>>>
>>>> On Fri, 27 Jul 2018 11:33:13 -0700
>>>> Brian Norris <computersforpeace@xxxxxxxxx> wrote:
>>>>  
>>>>> Commit 59b356ffd0b0 ("mtd: m25p80: restore the status of SPI flash when
>>>>> exiting") is the latest from a long history of attempts to add reboot
>>>>> handling to handle stateful addressing modes on SPI flash. Some prior
>>>>> mostly-related discussions:
>>>>>
>>>>> http://lists.infradead.org/pipermail/linux-mtd/2013-March/046343.html
>>>>> [PATCH 1/3] mtd: m25p80: utilize dedicated 4-byte addressing commands
>>>>>
>>>>> http://lists.infradead.org/pipermail/barebox/2014-September/020682.html
>>>>> [RFC] MTD m25p80 3-byte addressing and boot problem
>>>>>
>>>>> http://lists.infradead.org/pipermail/linux-mtd/2015-February/057683.html
>>>>> [PATCH 2/2] m25p80: if supported put chip to deep power down if not used
>>>>>
>>>>> Previously, attempts to add reboot-time software reset handling were
>>>>> rejected, but the latest attempt was not.
>>>>>
>>>>> Quick summary of the problem:
>>>>> Some systems (e.g., boot ROM or bootloader) assume that they can read
>>>>> initial boot code from their SPI flash using 3-byte addressing. If the
>>>>> flash is left in 4-byte mode after reset, these systems won't boot. The
>>>>> above patch provided a shutdown/remove hook to attempt to reset the
>>>>> addressing mode before we reboot. Notably, this patch misses out on
>>>>> huge classes of unexpected reboots (e.g., crashes, watchdog resets).
>>>>>
>>>>> Unfortunately, it is essentially impossible to solve this problem 100%:
>>>>> if your system doesn't know how to reset the SPI flash to power-on
>>>>> defaults at initialization time, no amount of software can really rescue
>>>>> you -- there will always be a chance of some unexpected reset that
>>>>> leaves your flash in an addressing mode that your boot sequence didn't
>>>>> expect.
>>>>>
>>>>> While it is not directly harmful to perform hacks like the
>>>>> aforementioned commit on all 4-byte addressing flash, a
>>>>> properly-designed system should not need the hack -- and in fact,
>>>>> providing this hack may mask the fact that a given system is indeed
>>>>> broken. So this patch attempts to apply this unsound hack more narrowly,
>>>>> providing a strong suggestion to developers and system designers that
>>>>> this is truly a hack. With luck, system designers can catch their errors
>>>>> early on in their development cycle, rather than applying this hack long
>>>>> term. But apparently enough systems are out in the wild that we still
>>>>> have to provide this hack.
>>>>>
>>>>> Document a new device tree property to denote systems that do not have a
>>>>> proper hardware (or software) reset mechanism, and apply the hack (with
>>>>> a loud warning) only in this case.
>>>>>
>>>>> Signed-off-by: Brian Norris <computersforpeace@xxxxxxxxx>
>>>>> ---
>>>>> Note that I intentionall didn't split the documentation patch. It seems
>>>>> clearer to do these together IMO, but if it's *really* important to
>>>>> someone...I can resend  
>>>>
>>>> I'm fine with that.
>>>>
>>>> I'll leave Neil some time to review/test/comment on the patch before
>>>> queuing it, but it looks good to me.  
>>>
>>> Thanks.
>>> I can confirm that if I apply this patch, my system won't reboot
>>> properly (as expected), and if I then add
>>>
>>> 		broken-flash-reset;
>>>
>>> to the jedec,spi-nor device, it starts functioning correctly again.
>>>
>>> I don't like the pejorative "broken", and it also suggests that a thing
>>> used to work, but something happened to break it - this is not
>>> accurate.
>>> I would prefer something like "reset-not-connected" which is an accurate
>>> description of the state of the hardware.
>>>
>>> I also think that having a WARN_ON is an over-reaction.  Certainly a
>>> warning could be appropriate, but just one pr_warn() should be enough.
>>> The "problem" is unlikely in practice, and loudly warning people that an
>>> asteroid might kill them isn't particularly helpful.
>>>
>>> I genuinely think that if the system fails to reboot, then Linux is at
>>> fault. I accept that changing Linux to be completely robust might be
>>> more trouble than it is worth, but I don't accept that it is impossible.
>>>
>>> But I don't intend to fight either of these battles.
>> 
>> Does that mean you're accepting this change? Brian, any comment on what
>> Neil said?
>> 
>> To be honest, I hate being in the middle of this discussion without
>> having been involved in the first decision to accept such workarounds.
>> I keep thinking that making boards that do not have reset properly
>> wired less likely to fail rebooting is a wise decision, but I also
>> agree with Brian when he says we should inform people that their design
>> is unreliable.
>
> Hiding the issue in most cases only leads to vendors making more such
> crippled boards and never learning.

And you think that printing a loud warning would be likely to get vendor
to make fewer crappy boards?
I think it would just annoy people who aren't in a position to do
anything about it.

NeilBrown

>
>> The main problem I see here, is that adding this prop won't help people
>> figuring out what is wrong with their design, it will just help them
>> workaround the problem when they find out, and it might already be to
>> late to fix the HW design. But maybe it's not what we're trying to do
>> here. Maybe we just want to warn users that rebooting such boards is a
>> risky procedure.
>
> The thing is, this is not a workaround, it's just a way of hiding the
> problem because the problem does not go away completely. There are still
> scenarios in which the system will fail.
>
> -- 
> Best regards,
> Marek Vasut
Attachment:
signature.asc

Description: PGP signature