On Mon, Nov 28, 2011 at 11:08 AM, Thomas Schauss <schauss@xxxxxx> wrote: > On 11/16/2011 04:06 PM, Thomas Gleixner wrote: >> >> On Wed, 16 Nov 2011, Thomas Schauss wrote: >>> >>> Unfortunately, with 3.0-rt and the nvidia-driver we get complete system >>> freezes when starting X on several different hardware setups (a few >>> systems >>> work fine). This is certainly caused by this combination. When using the >>> nouveau-driver everything works fine. >> >> Have you ever tried to run with CONFIG_PROVE_LOCKING=y ? >> > > Hello, > > thank you for that tip. I have tried this now and have not found any > warnings which seem related to the nvidia-driver. Further testing revealed, > that the driver works fine with CONFIG_PREEMPT_RTB and the freezes when > running startx occur as soon as we switch to CONFIG_PREEMPT_RT_FULL. > > Regarding lockdep, we do get some warnings in slab.c -> cache_flusharray > that however seem unrelated to nvidia. As we could not find any other bugs > with the same locking warning I attached one example below. You can find > some complete bootlogs (all with deadlock-warnings, all with slightly > different call-stack) and my kernel-config at > > http://www.lsr.ei.tum.de/team/schauss/lockdep/ > > On rt-base I also get a lockdep-warning which however seems unrelated to the > rt-full one (not in cache_flusharray). You can find that log on the same > page. > > Best Regards, > Thomas > > > > Nov 17 17:34:49 fix kernel: [ 30.750925] > ============================================= > Nov 17 17:34:49 fix kernel: [ 30.750927] [ INFO: possible recursive > locking detected ] > Nov 17 17:34:49 fix kernel: [ 30.750930] 3.0.9-25-rt #0 > Nov 17 17:34:49 fix kernel: [ 30.750931] > --------------------------------------------- > Nov 17 17:34:49 fix kernel: [ 30.750933] udevd/517 is trying to acquire > lock: > Nov 17 17:34:49 fix kernel: [ 30.750935] (&parent->list_lock){+.+...}, at: > [<ffffffff81613e63>] cache_flusharray+0x47/0xd6 > Nov 17 17:34:49 fix kernel: [ 30.750944] > Nov 17 17:34:49 fix kernel: [ 30.750945] but task is already holding lock: > Nov 17 17:34:49 fix kernel: [ 30.750946] (&parent->list_lock){+.+...}, at: > [<ffffffff81613e63>] cache_flusharray+0x47/0xd6 > Nov 17 17:34:49 fix kernel: [ 30.750950] > Nov 17 17:34:49 fix kernel: [ 30.750951] other info that might help us > debug this: > Nov 17 17:34:49 fix kernel: [ 30.750952] Possible unsafe locking > scenario: > Nov 17 17:34:49 fix kernel: [ 30.750953] > Nov 17 17:34:49 fix kernel: [ 30.750954] CPU0 > Nov 17 17:34:49 fix kernel: [ 30.750955] ---- > Nov 17 17:34:49 fix kernel: [ 30.750956] lock(&parent->list_lock); > Nov 17 17:34:49 fix kernel: [ 30.750958] lock(&parent->list_lock); > Nov 17 17:34:49 fix kernel: [ 30.750959] > Nov 17 17:34:49 fix kernel: [ 30.750960] *** DEADLOCK *** > Nov 17 17:34:49 fix kernel: [ 30.750961] > Nov 17 17:34:49 fix kernel: [ 30.750962] May be due to missing lock > nesting notation > Nov 17 17:34:49 fix kernel: [ 30.750963] > Nov 17 17:34:49 fix kernel: [ 30.750964] 2 locks held by udevd/517: > Nov 17 17:34:49 fix kernel: [ 30.750966] #0: (&per_cpu(slab_lock, > __cpu).lock){+.+...}, at: [<ffffffff8116a5c6>] kfree+0xd6/0x380 > Nov 17 17:34:49 fix kernel: [ 30.750973] #1: > (&parent->list_lock){+.+...}, at: [<ffffffff81613e63>] > cache_flusharray+0x47/0xd6 > Nov 17 17:34:49 fix kernel: [ 30.750977] > Nov 17 17:34:49 fix kernel: [ 30.750977] stack backtrace: > Nov 17 17:34:49 fix kernel: [ 30.750980] Pid: 517, comm: udevd Not tainted > 3.0.9-25-rt #0 > Nov 17 17:34:49 fix kernel: [ 30.750982] Call Trace: > Nov 17 17:34:49 fix kernel: [ 30.750987] [<ffffffff810a0097>] > print_deadlock_bug+0xf7/0x100 > Nov 17 17:34:49 fix kernel: [ 30.750991] [<ffffffff810a1add>] > validate_chain.isra.37+0x67d/0x720 > Nov 17 17:34:49 fix kernel: [ 30.750995] [<ffffffff810a2478>] > __lock_acquire+0x478/0x9c0 > Nov 17 17:34:49 fix kernel: [ 30.750999] [<ffffffff8162ae19>] ? > sub_preempt_count+0x29/0x60 > Nov 17 17:34:49 fix kernel: [ 30.751003] [<ffffffff81627475>] ? > _raw_spin_unlock+0x35/0x60 > Nov 17 17:34:49 fix kernel: [ 30.751007] [<ffffffff81625f0b>] ? > rt_spin_lock_slowlock+0x2eb/0x340 > Nov 17 17:34:49 fix kernel: [ 30.751011] [<ffffffff81056be1>] ? > get_parent_ip+0x11/0x50 > Nov 17 17:34:49 fix kernel: [ 30.751014] [<ffffffff81613e63>] ? > cache_flusharray+0x47/0xd6 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff810a2f64>] > lock_acquire+0x94/0x160 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff81613e63>] ? > cache_flusharray+0x47/0xd6 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff81626999>] > rt_spin_lock+0x39/0x40 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff81613e63>] ? > cache_flusharray+0x47/0xd6 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff8105a90b>] ? > migrate_disable+0x6b/0xe0 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff81613e63>] > cache_flusharray+0x47/0xd6 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff81167a41>] > kmem_cache_free+0x221/0x300 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff81167b8f>] > slab_destroy+0x6f/0xa0 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff81167d32>] > free_block+0x172/0x190 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff81613eb4>] > cache_flusharray+0x98/0xd6 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff814f1110>] ? > __sk_free+0x130/0x160 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff814f1110>] ? > __sk_free+0x130/0x160 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff8116a806>] > kfree+0x316/0x380 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff814f5328>] ? > skb_queue_purge+0x28/0x40 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff814f1110>] > __sk_free+0x130/0x160 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff814f11d5>] > sk_free+0x25/0x30 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff8152d908>] > netlink_release+0x128/0x200 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff814ea388>] > sock_release+0x28/0x90 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff814eaa57>] > sock_close+0x17/0x30 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff8117b914>] > __fput+0xb4/0x200 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff8117ba85>] > fput+0x25/0x30 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff81177d0c>] > filp_close+0x6c/0x90 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff81177df0>] > sys_close+0xc0/0x130 > Nov 17 17:34:49 fix kernel: [ 30.751015] [<ffffffff8162ed02>] > system_call_fastpath+0x16/0x1b > Hmm, I think I see how this can happen. cache_flusharray() spin_lock(&l3->list_lock); free_block(cachep, ac->entry, batchcount, node); slab_destroy() kmem_cache_free() __cache_free() cache_flusharray() -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html