Re: 3.2-rc1 and nvidia drivers

John Kacur <jkacur@xxxxxxxxxx> · Mon, 28 Nov 2011 12:31:48 +0100

On Mon, Nov 28, 2011 at 11:08 AM, Thomas Schauss <schauss@xxxxxx> wrote:
> On 11/16/2011 04:06 PM, Thomas Gleixner wrote:
>>
>> On Wed, 16 Nov 2011, Thomas Schauss wrote:
>>>
>>> Unfortunately, with 3.0-rt and the nvidia-driver we get complete system
>>> freezes when starting X on several different hardware setups (a few
>>> systems
>>> work fine). This is certainly caused by this combination. When using the
>>> nouveau-driver everything works fine.
>>
>> Have you ever tried to run with CONFIG_PROVE_LOCKING=y ?
>>
>
> Hello,
>
> thank you for that tip. I have tried this now and have not found any
> warnings which seem related to the nvidia-driver. Further testing revealed,
> that the driver works fine with CONFIG_PREEMPT_RTB and the freezes when
> running startx occur as soon as we switch to CONFIG_PREEMPT_RT_FULL.
>
> Regarding lockdep, we do get some warnings in slab.c -> cache_flusharray
> that however seem unrelated to nvidia. As we could not find any other bugs
> with the same locking warning I attached one example below. You can find
> some complete bootlogs (all with deadlock-warnings, all with slightly
> different call-stack) and my kernel-config at
>
> http://www.lsr.ei.tum.de/team/schauss/lockdep/
>
> On rt-base I also get a lockdep-warning which however seems unrelated to the
> rt-full one (not in cache_flusharray). You can find that log on the same
> page.
>
> Best Regards,
> Thomas
>
>
>
> Nov 17 17:34:49 fix kernel: [   30.750925]
> =============================================
> Nov 17 17:34:49 fix kernel: [   30.750927] [ INFO: possible recursive
> locking detected ]
> Nov 17 17:34:49 fix kernel: [   30.750930] 3.0.9-25-rt #0
> Nov 17 17:34:49 fix kernel: [   30.750931]
> ---------------------------------------------
> Nov 17 17:34:49 fix kernel: [   30.750933] udevd/517 is trying to acquire
> lock:
> Nov 17 17:34:49 fix kernel: [   30.750935] (&parent->list_lock){+.+...}, at:
> [<ffffffff81613e63>] cache_flusharray+0x47/0xd6
> Nov 17 17:34:49 fix kernel: [   30.750944]
> Nov 17 17:34:49 fix kernel: [   30.750945] but task is already holding lock:
> Nov 17 17:34:49 fix kernel: [   30.750946] (&parent->list_lock){+.+...}, at:
> [<ffffffff81613e63>] cache_flusharray+0x47/0xd6
> Nov 17 17:34:49 fix kernel: [   30.750950]
> Nov 17 17:34:49 fix kernel: [   30.750951] other info that might help us
> debug this:
> Nov 17 17:34:49 fix kernel: [   30.750952]  Possible unsafe locking
> scenario:
> Nov 17 17:34:49 fix kernel: [   30.750953]
> Nov 17 17:34:49 fix kernel: [   30.750954]        CPU0
> Nov 17 17:34:49 fix kernel: [   30.750955]        ----
> Nov 17 17:34:49 fix kernel: [   30.750956]   lock(&parent->list_lock);
> Nov 17 17:34:49 fix kernel: [   30.750958]   lock(&parent->list_lock);
> Nov 17 17:34:49 fix kernel: [   30.750959]
> Nov 17 17:34:49 fix kernel: [   30.750960]  *** DEADLOCK ***
> Nov 17 17:34:49 fix kernel: [   30.750961]
> Nov 17 17:34:49 fix kernel: [   30.750962]  May be due to missing lock
> nesting notation
> Nov 17 17:34:49 fix kernel: [   30.750963]
> Nov 17 17:34:49 fix kernel: [   30.750964] 2 locks held by udevd/517:
> Nov 17 17:34:49 fix kernel: [   30.750966]  #0:  (&per_cpu(slab_lock,
> __cpu).lock){+.+...}, at: [<ffffffff8116a5c6>] kfree+0xd6/0x380
> Nov 17 17:34:49 fix kernel: [   30.750973]  #1:
> (&parent->list_lock){+.+...}, at: [<ffffffff81613e63>]
> cache_flusharray+0x47/0xd6
> Nov 17 17:34:49 fix kernel: [   30.750977]
> Nov 17 17:34:49 fix kernel: [   30.750977] stack backtrace:
> Nov 17 17:34:49 fix kernel: [   30.750980] Pid: 517, comm: udevd Not tainted
> 3.0.9-25-rt #0
> Nov 17 17:34:49 fix kernel: [   30.750982] Call Trace:
> Nov 17 17:34:49 fix kernel: [   30.750987]  [<ffffffff810a0097>]
> print_deadlock_bug+0xf7/0x100
> Nov 17 17:34:49 fix kernel: [   30.750991]  [<ffffffff810a1add>]
> validate_chain.isra.37+0x67d/0x720
> Nov 17 17:34:49 fix kernel: [   30.750995]  [<ffffffff810a2478>]
> __lock_acquire+0x478/0x9c0
> Nov 17 17:34:49 fix kernel: [   30.750999]  [<ffffffff8162ae19>] ?
> sub_preempt_count+0x29/0x60
> Nov 17 17:34:49 fix kernel: [   30.751003]  [<ffffffff81627475>] ?
> _raw_spin_unlock+0x35/0x60
> Nov 17 17:34:49 fix kernel: [   30.751007]  [<ffffffff81625f0b>] ?
> rt_spin_lock_slowlock+0x2eb/0x340
> Nov 17 17:34:49 fix kernel: [   30.751011]  [<ffffffff81056be1>] ?
> get_parent_ip+0x11/0x50
> Nov 17 17:34:49 fix kernel: [   30.751014]  [<ffffffff81613e63>] ?
> cache_flusharray+0x47/0xd6
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff810a2f64>]
> lock_acquire+0x94/0x160
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff81613e63>] ?
> cache_flusharray+0x47/0xd6
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff81626999>]
> rt_spin_lock+0x39/0x40
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff81613e63>] ?
> cache_flusharray+0x47/0xd6
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff8105a90b>] ?
> migrate_disable+0x6b/0xe0
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff81613e63>]
> cache_flusharray+0x47/0xd6
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff81167a41>]
> kmem_cache_free+0x221/0x300
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff81167b8f>]
> slab_destroy+0x6f/0xa0
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff81167d32>]
> free_block+0x172/0x190
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff81613eb4>]
> cache_flusharray+0x98/0xd6
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff814f1110>] ?
> __sk_free+0x130/0x160
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff814f1110>] ?
> __sk_free+0x130/0x160
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff8116a806>]
> kfree+0x316/0x380
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff814f5328>] ?
> skb_queue_purge+0x28/0x40
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff814f1110>]
> __sk_free+0x130/0x160
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff814f11d5>]
> sk_free+0x25/0x30
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff8152d908>]
> netlink_release+0x128/0x200
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff814ea388>]
> sock_release+0x28/0x90
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff814eaa57>]
> sock_close+0x17/0x30
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff8117b914>]
> __fput+0xb4/0x200
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff8117ba85>]
> fput+0x25/0x30
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff81177d0c>]
> filp_close+0x6c/0x90
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff81177df0>]
> sys_close+0xc0/0x130
> Nov 17 17:34:49 fix kernel: [   30.751015]  [<ffffffff8162ed02>]
> system_call_fastpath+0x16/0x1b
>

Hmm, I think I see how this can happen.

cache_flusharray()
spin_lock(&l3->list_lock);
free_block(cachep, ac->entry, batchcount, node);
        slab_destroy()
        kmem_cache_free()
                __cache_free()
                cache_flusharray()
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html