Re: [PATCH v2] mm/page_isolation: fix a deadlock with printk()

Qian Cai <cai@xxxxxx> · Mon, 7 Oct 2019 07:04:00 -0400

On Oct 7, 2019, at 4:07 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:

I do not think that removing the printk is the right long term solution.
While I do agree that removing the debugging printk __offline_isolated_pages
does make sense because it is essentially of a very limited use, this
doesn't really solve the underlying problem.  There are likely other
printks from zone->lock. It would be much more saner to actually
disallow consoles to allocate any memory while printk is called from an
atomic context.

No, there is only a handful of places called printk() from zone->lock. It is normal that the callers will quietly process “struct zone” modification in a short section with zone->lock held.

No, it is not about “allocate any memory while printk is called from an
atomic context”. It is opposite lock chain  from different processors which has the same effect. For example,

CPU0:                 CPU1:         CPU2:
console_owner
                            sclp_lock
sclp_lock                                 zone_lock
                            zone_lock
                                                 console_owner

Here it is a deadlock.

The problem is probably there forever, but neither many developers will
run memory offline with the lockdep enabled nor admins in the field are
lucky enough yet to hit a perfect timing which required to trigger a
real deadlock. In addition, there aren't many places that call printk()
while zone->lock was held.

WARNING: possible circular locking dependency detected
------------------------------------------------------
test.sh/1724 is trying to acquire lock:
0000000052059ec0 (console_owner){-...}, at: console_unlock+0x
01: 328/0xa30

but task is already holding lock:
000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at: start_iso
01: late_page_range+0x216/0x538
Show Quoted Content
The problem is probably there forever, but neither many developers will
run memory offline with the lockdep enabled nor admins in the field are
lucky enough yet to hit a perfect timing which required to trigger a
real deadlock. In addition, there aren't many places that call printk()
while zone->lock was held.

WARNING: possible circular locking dependency detected
------------------------------------------------------
test.sh/1724 is trying to acquire lock:
0000000052059ec0 (console_owner){-...}, at: console_unlock+0x
01: 328/0xa30

but task is already holding lock:
000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at: start_iso
01: late_page_range+0x216/0x538
The problem is probably there forever, but neither many developers will
run memory offline with the lockdep enabled nor admins in the field are
lucky enough yet to hit a perfect timing which required to trigger a
real deadlock. In addition, there aren't many places that call printk()
while zone->lock was held.

WARNING: possible circular locking dependency detected
------------------------------------------------------
test.sh/1724 is trying to acquire lock:
0000000052059ec0 (console_owner){-...}, at: console_unlock+0x
01: 328/0xa30

but task is already holding lock:
000000006ffd89c8 (&(&zone->lock)->rlock){-.-.}, at: start_iso
01: late_page_range+0x216/0x538

I am also wondering what does this lockdep report actually say. How come
we have a dependency between a start_kernel path and a syscall?

Petr explained it correctly.