On Thu 2016-07-07 20:27:13, Topi Miettinen wrote: > On 07/07/16 09:16, Petr Mladek wrote: > > On Sun 2016-07-03 15:08:07, Topi Miettinen wrote: > >> The attached patch would make any uses of capabilities generate audit > >> messages. It works for simple tests as you can see from the commit > >> message, but unfortunately the call to audit_cgroup_list() deadlocks the > >> system when booting a full blown OS. There's no deadlock when the call > >> is removed. > >> > >> I guess that in some cases, cgroup_mutex and/or css_set_lock could be > >> already held earlier before entering audit_cgroup_list(). Holding the > >> locks is however required by task_cgroup_from_root(). Is there any way > >> to avoid this? For example, only print some kind of cgroup ID numbers > >> (are there unique and stable IDs, available without locks?) for those > >> cgroups where the task is registered in the audit message? > > > > I am not sure if anyone know what really happens here. I suggest to > > enable lockdep. It might detect possible deadlock even before it > > really happens, see Documentation/locking/lockdep-design.txt > > > > It can be enabled by > > > > CONFIG_PROVE_LOCKING=y > > > > It depends on > > > > CONFIG_DEBUG_KERNEL=y > > > > and maybe some more options, see lib/Kconfig.debug > > Thanks a lot! I caught this stack dump: > > starting version 230 > [ 3.416647] ------------[ cut here ]------------ > [ 3.417310] WARNING: CPU: 0 PID: 95 at > /home/topi/d/linux.git/kernel/locking/lockdep.c:2871 > lockdep_trace_alloc+0xb4/0xc0 > [ 3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)) > [ 3.417923] Modules linked in: > [ 3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97 > [ 3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS Debian-1.8.2-1 04/01/2014 > [ 3.418726] 0000000000000086 000000007970f3b0 ffff88000016fb00 > ffffffff813c9c45 > [ 3.418993] ffff88000016fb50 0000000000000000 ffff88000016fb40 > ffffffff81091e9b > [ 3.419176] 00000b3705e2c798 0000000000000046 0000000000000410 > 00000000ffffffff > [ 3.419374] Call Trace: > [ 3.419511] [<ffffffff813c9c45>] dump_stack+0x67/0x92 > [ 3.419644] [<ffffffff81091e9b>] __warn+0xcb/0xf0 > [ 3.419745] [<ffffffff81091f1f>] warn_slowpath_fmt+0x5f/0x80 > [ 3.419868] [<ffffffff810e9a84>] lockdep_trace_alloc+0xb4/0xc0 > [ 3.419988] [<ffffffff8120dc42>] kmem_cache_alloc_node+0x42/0x600 > [ 3.420156] [<ffffffff8110432d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 > [ 3.420170] [<ffffffff8163183b>] __alloc_skb+0x5b/0x1d0 > [ 3.420170] [<ffffffff81144f6b>] audit_log_start+0x29b/0x480 > [ 3.420170] [<ffffffff810a2925>] ? __lock_task_sighand+0x95/0x270 > [ 3.420170] [<ffffffff81145cc9>] audit_log_cap_use+0x39/0xf0 > [ 3.420170] [<ffffffff8109cd75>] ns_capable+0x45/0x70 > [ 3.420170] [<ffffffff8109cdb7>] capable+0x17/0x20 > [ 3.420170] [<ffffffff812a2f50>] oom_score_adj_write+0x150/0x2f0 > [ 3.420170] [<ffffffff81230997>] __vfs_write+0x37/0x160 > [ 3.420170] [<ffffffff810e33b7>] ? update_fast_ctr+0x17/0x30 > [ 3.420170] [<ffffffff810e3449>] ? percpu_down_read+0x49/0x90 > [ 3.420170] [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0 > [ 3.420170] [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0 > [ 3.420170] [<ffffffff81231048>] vfs_write+0xb8/0x1b0 > [ 3.420170] [<ffffffff812533c6>] ? __fget_light+0x66/0x90 > [ 3.420170] [<ffffffff81232078>] SyS_write+0x58/0xc0 > [ 3.420170] [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300 > [ 3.420170] [<ffffffff81849c9a>] entry_SYSCALL64_slow_path+0x25/0x25 > [ 3.420170] ---[ end trace fb586899fb556a5e ]--- > [ 3.447922] random: systemd-udevd urandom read with 3 bits of entropy > available > [ 4.014078] clocksource: Switched to clocksource tsc > Begin: Loading essential drivers ... done. > > This is with qemu and the boot continues normally. With real computer, > there's no such output and system just seems to freeze. > > Could it be possible that the deadlock happens because there's some IO > towards /sys/fs/cgroup, which causes a capability check and that in turn > causes locking problems when we try to print cgroup list? The above warning is printed by the code from kernel/locking/lockdep.c:2871 static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) { [...] /* We're only interested __GFP_FS allocations for now */ if (!(gfp_mask & __GFP_FS)) return; /* * Oi! Can't be having __GFP_FS allocations with IRQs disabled. */ if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))) return; The backtrace shows that your new audit_log_cap_use() is called from vfs_write(). You might try to use audit_log_start() with GFP_NOFS instead of GFP_KERNEL. Note that this is rather intuitive advice. I still need to learn a lot about memory management and kernel in general to be more sure about a correct solution. Best Regards, Petr -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html