On Wednesday 12 April 2017 01:56 PM, Zhang Rui wrote: > On Wed, 2017-04-12 at 13:25 +0530, Keerthy wrote: >> >> On Wednesday 12 April 2017 09:35 AM, Eduardo Valentin wrote: >>> >>> Keerthy, >>> >>> On Wed, Apr 12, 2017 at 09:09:36AM +0530, Keerthy wrote: >>>> >>>> >>>> >>>> On Wednesday 12 April 2017 08:50 AM, Zhang Rui wrote: >>>>> >>>>> On Wed, 2017-04-12 at 08:19 +0530, Keerthy wrote: >>>>>> >>>>>> >>>>>> On Tuesday 11 April 2017 10:59 PM, Eduardo Valentin wrote: >>>>>>> >>>>>>> >>>>>>> Hey, >>>>>>> >>>>>>> On Fri, Mar 31, 2017 at 12:00:20PM +0530, Keerthy wrote: >>>>>>>> >>>>>>>> >>>>>>>> off). >>> <cut> >>> >>>> >>>>> >>>>>> >>>>>>> >>>>>>> OK... This seams to me, still a corner case supposed to be >>>>>>> fixed at >>>>>>> orderly_power_off, not at thermal. But.. >>>>>>> >>> ^^^ Then again, this must be fixed not at thermal core. And re- >>> reading >>> the history of this thread, this seams to be really something >>> broken at >>> OMAP/DRA7, as mentioned in previous messages. That is probably why >>> you >>> are pushing for pm_power_off(), which seams to be the one that >>> works for >>> you, pulling the plug correctly (DRA7). >> Zhang/Eduardo, >> >> OMAP5/DRA7 is one case. >> >> I believe i this is the root cause of this failure. >> >> thermal_zone_device_check --> thermal_zone_device_update --> >> handle_thermal_trip --> handle_critical_trips --> orderly_poweroff >> >> The above sequence happens every 250/500 mS based on the >> configuration. >> The orderly_poweroff function is getting called every 250/500 mS and >> i >> see with a full fledged nfs file system it takes at least 5-10 >> Seconds >> to shutdown and during that time we bombard with orderly_poweroff >> calls >> multiple times due to the thermal_zone_device_check triggering >> periodically. >> >> To confirm that i made sure that handle_critical_trips calls >> orderly_poweroff only once and i no longer see the failure on DRA72- >> EVM >> board. >> > Nice catch! Thanks. > >> So IMHO once we get to handle_critical_trips case where we do >> orderly_poweroff we need to do the following: >> >> 1) Make sure that orderly_poweroff is called only once. > > agreed. > >> 2) Cancel all the scheduled work queues to monitor the temperature as >> we have already reached a point of shutting down the system. >> > agreed. > > now I think we've found the root cause of the problem. > orderly_poweroff() is not reenterable and it does not have to be. > If we're using orderly_poweroff() for emergency power off, we have to > use it correctly. > > will you generate a patch to do this? Sure. I will generate a patch to take care of 1) To make sure that orderly_poweroff is called only once right away. I have already tested. for 2) Cancel all the scheduled work queues to monitor the temperature. I will take some more time to make it and test. Is that okay? Or you want me to send both together? Regards, Keerthy > > thanks, > rui > >> Let me know your thoughts on this. >> >> Best Regards, >> Keerthy >>> >>> >>>> >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> However, there is no clean way of detecting such failure >>>>>>>> of >>>>>>>> userspace >>>>>>>> powering off the system. In such scenarios, it is >>>>>>>> necessary for a >>>>>>>> backup >>>>>>>> workqueue to be able to force a shutdown of the system >>>>>>>> when >>>>>>>> orderly >>>>>>>> shutdown is not successful after a configurable time >>>>>>>> period. >>>>>>>> >>>>>>> Given that system running hot is a thermal issue, I guess >>>>>>> we care >>>>>>> more >>>>>>> on this matter then.. >>>>>> Yes! >>>>>> >>>>> I just read this thread again https://patchwork.kernel.org/patc >>>>> h/802458 >>>>> 1/ to recall the previous discussion. >>>>> >>>>> https://patchwork.kernel.org/patch/8149891/ >>>>> https://patchwork.kernel.org/patch/8149861/ >>>>> should be the solution made based on Ingo' suggestion, right? >>>>> >>>>> And to me, this sounds like the right direction to go, thermal >>>>> does not >>>>> need a back up shutdown solution, it just needs a kernel >>>>> function call >>>>> which guarantees the system can be shutdown/reboot immediately. >>>>> >>>>> is there any reason that patch 1/2 is not accepted? >>>> Zhang, >>>> >>>> http://www.serverphorums.com/read.php?12,1400964 >>>> >>>> I got a NAK from Alan and was given this direction on a >>>> thermal_poweroff >>>> which is more or less what is done in this patch. >>>> >>> >>> Actually, Alan's suggestion is more for you to define a >>> thermal_poweroff() that can be defined per architecture. >>> >>> Also, please, keep track of your patch versions and also do copy >>> everybody who has stated their opinion on previous discussions. >>> These >>> patches must have Ingo, Alan, and RMK copied too. In this way we >>> avoid >>> loosing track of what has been suggested and we also converge >>> faster to >>> something everybody (or most of us) agree. Next version, please, >>> fix >>> that. >>> >>> >>> To me, thermal core needs a function that simply powers off the >>> system. >>> No timeouts, delayed works, backups, etc. Simple and straight. >>> >>> The idea of having a per architecture implementation, as per Alan's >>> suggestion, makes sense to me too. Having something different from >>> pm_power_off(), specific to thermal, might also give the >>> opportunity to >>> save the power off reason. >>> >>> BR, >>> >>> Eduardo Valentin >>> -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html