On Mon 2019-10-07 09:07:02, Qian Cai wrote: > On Mon, 2019-10-07 at 14:43 +0200, Michal Hocko wrote: > > On Mon 07-10-19 08:11:44, Qian Cai wrote: > > > On Mon, 2019-10-07 at 13:37 +0200, Michal Hocko wrote: > > > > On Mon 07-10-19 07:04:00, Qian Cai wrote: > > > > > > > > > > > > > > > > On Oct 7, 2019, at 4:07 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > > > > > > > > > > > I do not think that removing the printk is the right long term solution. > > > > > > While I do agree that removing the debugging printk __offline_isolated_pages > > > > > > does make sense because it is essentially of a very limited use, this > > > > > > doesn't really solve the underlying problem. There are likely other > > > > > > printks from zone->lock. It would be much more saner to actually > > > > > > disallow consoles to allocate any memory while printk is called from an > > > > > > atomic context. > > > > > > > > > > No, there is only a handful of places called printk() from > > > > > zone->lock. It is normal that the callers will quietly process > > > > > “struct zone” modification in a short section with zone->lock > > > > > held. > > > > > > > > It is extremely error prone to have any zone->lock vs. printk > > > > dependency. I do not want to play an endless whack a mole. > > > > > > > > > No, it is not about “allocate any memory while printk is called from an > > > > > atomic context”. It is opposite lock chain from different processors which has the same effect. For example, > > > > > > > > > > CPU0: CPU1: CPU2: > > > > > console_owner > > > > > sclp_lock > > > > > sclp_lock zone_lock > > > > > zone_lock > > > > > console_owner > > > > > > > > Why would sclp_lock ever take a zone->lock (apart from an allocation). > > > > So really if sclp_lock is a lock that might be taken from many contexts > > > > and generate very subtle lock dependencies then it should better be > > > > really careful what it is calling into. > > > > > > > > In other words you are trying to fix a wrong end of the problem. Fix the > > > > console to not allocate or depend on MM by other means. > > > > > > It looks there are way too many places that could generate those indirect lock > > > chains that are hard to eliminate them all. Here is anther example, where it > > > has, > > > > Yeah and I strongly suspect they are consoles which are broken and need > > to be fixed rathert than the problem papered over. > > > > I do realize how tempting it is to remove all printks from the > > zone->lock but do realize that as soon as the allocator starts using any > > other locks then we are back to square one and the problem is there > > again. We would have to drop _all_ printks from any locked section in > > the allocator and I do not think this is viable. > > > > Really, the only way forward is to make these consoles be more careful > > of external dependencies. > > Even with the new printk() Petr proposed. There is no guarantee it will fix it > properly. It looks like just reduce the chance of this kind of deadlocks as it > may or may not call wake_up_klogd() in vprintk_emit() depends on timing. The chain below is wrong: > zone->lock > printk_deferred() > vprintk_emit() > wake_up_klogd() wake_up_klogd() calls irq_work_queue(). It queues the work for an interrupt handler and triggers the interrupt. > wake_up_klogd_work_func() > console_unlock() The work is done in the interrupt context. The interrupt could never be handled under zone->lock. So, printk_deferred() would help. But I do not think that it is really needed. I am going to answer the original mail with all the full lockdep report. Best Regards, Petr