Re: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Sat, 24 May 2014 23:22:09 +0900

Hello.

I tried to test whether it is OK (from point of view of reentrant) to use
mutex_lock() or mutex_lock_killable() inside shrinker functions when shrinker
functions do memory allocation, for drivers/gpu/drm/ttm/ttm_page_alloc_dma.c is
doing memory allocation with mutex lock held inside ttm_dma_pool_shrink_scan().

If I compile a test module shown below which mimics extreme case of what
ttm_dma_pool_shrink_scan() will do

---------- test.c start ----------
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/mm.h>

static DEFINE_MUTEX(lock);

static unsigned long shrink_test_count(struct shrinker *shrinker, struct shrink_control *sc)
{
        if (mutex_lock_killable(&lock)) {
                printk(KERN_WARNING "Process %u (%s) gave up waiting for mutex"
                       "\n", current->pid, current->comm);
                return 0;
        }
        mutex_unlock(&lock);
        return 1;
}

static unsigned long shrink_test_scan(struct shrinker *shrinker, struct shrink_control *sc)
{
        LIST_HEAD(list);
        int i = 0;
        if (mutex_lock_killable(&lock)) {
                printk(KERN_WARNING "Process %u (%s) gave up waiting for mutex"
                       "\n", current->pid, current->comm);
                return 0;
        }
        while (1) {
                struct list_head *l = kmalloc(PAGE_SIZE, sc->gfp_mask);
                if (!l)
                        break;
                list_add_tail(l, &list);
                i++;
        }
        printk(KERN_WARNING "Process %u (%s) allocated %u pages\n",
               current->pid, current->comm, i);
        while (i--) {
                struct list_head *l = list.next;
                list_del(l);
                kfree(l);
        }
        mutex_unlock(&lock);
        return 1;
}

static struct shrinker recursive_shrinker = {
        .count_objects = shrink_test_count,
        .scan_objects = shrink_test_scan,
        .seeks = DEFAULT_SEEKS,
};

static int __init recursive_shrinker_init(void)
{
        register_shrinker(&recursive_shrinker);
        return 0;
}

static void recursive_shrinker_exit(void)
{
        unregister_shrinker(&recursive_shrinker);
}

module_init(recursive_shrinker_init);
module_exit(recursive_shrinker_exit);
MODULE_LICENSE("GPL");
---------- test.c end ----------

and load the test module and do

  # echo 3 > /proc/sys/vm/drop_caches

the system stalls with 0% CPU usage because of mutex deadlock
(with prior lockdep warning).

Is this because wrong gfp flags are passed to kmalloc() ? Is this because
the test module's shrinker functions return wrong values? Is this because
doing memory allocation with mutex held inside shrinker functions is
forbidden? Can anybody tell me what is wrong with my test module?

Regards.

[   48.077353] 
[   48.077999] =================================
[   48.080023] [ INFO: inconsistent lock state ]
[   48.080023] 3.15.0-rc6-00190-g1ee1cea #203 Tainted: G           OE
[   48.080023] ---------------------------------
[   48.080023] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[   48.086745] kswapd0/784 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   48.086745]  (lock#2){+.+.?.}, at: [<e0861022>] shrink_test_count+0x12/0x60 [test]
[   48.086745] {RECLAIM_FS-ON-W} state was registered at:
[   48.086745]   [<c1089c18>] mark_held_locks+0x68/0x90
[   48.086745]   [<c1089cda>] lockdep_trace_alloc+0x9a/0xe0
[   48.086745]   [<c110b7f3>] kmem_cache_alloc+0x23/0x170
[   48.086745]   [<e08610aa>] shrink_test_scan+0x3a/0xf90 [test]
[   48.086745]   [<c10e59be>] shrink_slab_node+0x13e/0x1d0
[   48.086745]   [<c10e6911>] shrink_slab+0x61/0xe0
[   48.086745]   [<c115f849>] drop_caches_sysctl_handler+0x69/0xf0
[   48.086745]   [<c117275a>] proc_sys_call_handler+0x6a/0xa0
[   48.086745]   [<c11727aa>] proc_sys_write+0x1a/0x20
[   48.086745]   [<c1110ac0>] vfs_write+0xa0/0x190
[   48.086745]   [<c1110ca6>] SyS_write+0x56/0xc0
[   48.086745]   [<c15201d6>] syscall_call+0x7/0xb
[   48.086745] irq event stamp: 39
[   48.086745] hardirqs last  enabled at (39): [<c10f3480>] count_shadow_nodes+0x20/0x40
[   48.086745] hardirqs last disabled at (38): [<c10f346c>] count_shadow_nodes+0xc/0x40
[   48.086745] softirqs last  enabled at (0): [<c1040627>] copy_process+0x2e7/0x1400
[   48.086745] softirqs last disabled at (0): [<  (null)>]   (null)
[   48.086745] 
[   48.086745] other info that might help us debug this:
[   48.086745]  Possible unsafe locking scenario:
[   48.086745] 
[   48.086745]        CPU0
[   48.086745]        ----
[   48.086745]   lock(lock#2);
[   48.086745]   <Interrupt>
[   48.086745]     lock(lock#2);
[   48.086745] 
[   48.086745]  *** DEADLOCK ***
[   48.086745] 
[   48.086745] 1 lock held by kswapd0/784:
[   48.086745]  #0:  (shrinker_rwsem){++++.+}, at: [<c10e68da>] shrink_slab+0x2a/0xe0
[   48.086745] 
[   48.086745] stack backtrace:
[   48.086745] CPU: 1 PID: 784 Comm: kswapd0 Tainted: G           OE 3.15.0-rc6-00190-g1ee1cea #203
[   48.086745] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 08/15/2008
[   48.086745]  c1ab9c20 dd187c94 c151a48f dd184250 dd187cd0 c1088f33 c165aa02 c165ac9d
[   48.086745]  00000310 00000000 00000000 00000000 00000000 00000001 00000001 c165ac9d
[   48.086745]  dd1847dc 0000000a 00000008 dd187cfc c1089ae1 00000008 000001b8 31a0987d
[   48.086745] Call Trace:
[   48.086745]  [<c151a48f>] dump_stack+0x48/0x61
[   48.086745]  [<c1088f33>] print_usage_bug+0x1f3/0x250
[   48.086745]  [<c1089ae1>] mark_lock+0x331/0x400
[   48.086745]  [<c1088f90>] ? print_usage_bug+0x250/0x250
[   48.086745]  [<c108a583>] __lock_acquire+0x283/0x1640
[   48.086745]  [<c108b9bb>] lock_acquire+0x7b/0xa0
[   48.086745]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   48.086745]  [<c151c544>] mutex_lock_killable_nested+0x64/0x3e0
[   48.086745]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   48.086745]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   48.086745]  [<c11119f1>] ? put_super+0x21/0x30
[   48.086745]  [<e0861022>] shrink_test_count+0x12/0x60 [test]
[   48.086745]  [<c10e58ae>] shrink_slab_node+0x2e/0x1d0
[   48.086745]  [<c10e68da>] ? shrink_slab+0x2a/0xe0
[   48.086745]  [<c10e6911>] shrink_slab+0x61/0xe0
[   48.086745]  [<c10e8416>] kswapd+0x5f6/0x8e0
[   48.086745]  [<c1062e0f>] kthread+0xaf/0xd0
[   48.086745]  [<c10e7e20>] ? try_to_free_pages+0x540/0x540
[   48.086745]  [<c108a08b>] ? trace_hardirqs_on+0xb/0x10
[   48.086745]  [<c1525d41>] ret_from_kernel_thread+0x21/0x30
[   48.086745]  [<c1062d60>] ? __init_kthread_worker+0x60/0x60

[   77.958388] SysRq : Show State
[   77.959377]   task                PC stack   pid father
[   77.960803] bash            D dfa6ae80  5068     1      0 0x00000000
[   77.962348]  ded35c30 00000046 dfa6ae90 dfa6ae80 322a9328 00000000 00000000 0000000b
[   77.962348]  ded32010 c191c8c0 ded34008 32319d5a 0000000b c191c8c0 32319d5a 0000000b
[   77.962348]  ded32010 00000001 ded35c04 3230ea9d 0000000b e0863060 ded325c4 ded32010
[   77.962348] Call Trace:
[   77.962348]  [<c1073ab7>] ? local_clock+0x17/0x30
[   77.962348]  [<c151b41e>] schedule+0x1e/0x60
[   77.962348]  [<c151b6df>] schedule_preempt_disabled+0xf/0x20
[   77.962348]  [<c151c63f>] mutex_lock_killable_nested+0x15f/0x3e0
[   77.962348]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   77.962348]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   77.962348]  [<e0861022>] shrink_test_count+0x12/0x60 [test]
[   77.962348]  [<c10e58ae>] shrink_slab_node+0x2e/0x1d0
[   77.962348]  [<c10e68da>] ? shrink_slab+0x2a/0xe0
[   77.962348]  [<c10e6911>] shrink_slab+0x61/0xe0
[   77.962348]  [<c10e7b48>] try_to_free_pages+0x268/0x540
[   77.962348]  [<c10df529>] __alloc_pages_nodemask+0x3e9/0x720
[   77.962348]  [<c110bcbd>] cache_alloc_refill+0x37d/0x720
[   77.962348]  [<e08610aa>] ? shrink_test_scan+0x3a/0xf90 [test]
[   77.962348]  [<c110b902>] kmem_cache_alloc+0x132/0x170
[   77.962348]  [<e08610aa>] ? shrink_test_scan+0x3a/0xf90 [test]
[   77.962348]  [<e08610aa>] shrink_test_scan+0x3a/0xf90 [test]
[   77.962348]  [<c151c4d8>] ? mutex_unlock+0x8/0x10
[   77.962348]  [<c10e59be>] shrink_slab_node+0x13e/0x1d0
[   77.962348]  [<c10e68da>] ? shrink_slab+0x2a/0xe0
[   77.962348]  [<c10e6911>] shrink_slab+0x61/0xe0
[   77.962348]  [<c115f849>] drop_caches_sysctl_handler+0x69/0xf0
[   77.962348]  [<c151fe6d>] ? _raw_spin_unlock+0x1d/0x30
[   77.962348]  [<c117275a>] proc_sys_call_handler+0x6a/0xa0
[   77.962348]  [<c11727aa>] proc_sys_write+0x1a/0x20
[   77.962348]  [<c1110ac0>] vfs_write+0xa0/0x190
[   77.962348]  [<c1172790>] ? proc_sys_call_handler+0xa0/0xa0
[   77.962348]  [<c112d0fd>] ? __fdget+0xd/0x10
[   77.962348]  [<c1110ca6>] SyS_write+0x56/0xc0
[   77.962348]  [<c15201d6>] syscall_call+0x7/0xb

[   77.962348] kswapd0         D 00000246  6200   784      2 0x00000000
[   77.962348]  dd187d9c 00000046 c109d091 00000246 00000086 00000000 00000246 dd184250
[   77.962348]  dd184250 c191c8c0 dd186008 318e2084 0000000b c191c8c0 37e97ef1 0000000b
[   77.962348]  dd184250 dd184250 dd187d70 c10880aa dd187dac 00000000 0000007b ffffffff
[   77.962348] Call Trace:
[   77.962348]  [<c109d091>] ? rcu_irq_exit+0x71/0xc0
[   77.962348]  [<c10880aa>] ? print_lock_contention_bug+0x1a/0xf0
[   77.962348]  [<c151b41e>] schedule+0x1e/0x60
[   77.962348]  [<c151b6df>] schedule_preempt_disabled+0xf/0x20
[   77.962348]  [<c151c63f>] mutex_lock_killable_nested+0x15f/0x3e0
[   77.962348]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   77.962348]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   77.962348]  [<e0861022>] shrink_test_count+0x12/0x60 [test]
[   77.962348]  [<c10e58ae>] shrink_slab_node+0x2e/0x1d0
[   77.962348]  [<c10e68da>] ? shrink_slab+0x2a/0xe0
[   77.962348]  [<c10e6911>] shrink_slab+0x61/0xe0
[   77.962348]  [<c10e8416>] kswapd+0x5f6/0x8e0
[   77.962348]  [<c1062e0f>] kthread+0xaf/0xd0
[   77.962348]  [<c10e7e20>] ? try_to_free_pages+0x540/0x540
[   77.962348]  [<c108a08b>] ? trace_hardirqs_on+0xb/0x10
[   77.962348]  [<c1525d41>] ret_from_kernel_thread+0x21/0x30
[   77.962348]  [<c1062d60>] ? __init_kthread_worker+0x60/0x60

Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > From e314a1a1583e585d062dfc30c8aad8bf5380510b Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> > Date: Mon, 19 May 2014 18:43:21 +0900
> > Subject: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
> > 
> > I can observe that RHEL7 environment stalls with 100% CPU usage when a
> > certain type of memory pressure is given. While the shrinker functions
> > are called by shrink_slab() before the OOM killer is triggered, the stall
> > lasts for many minutes.
> > 
> > I added debug printk() and observed that many threads are blocked for more
> > than 10 seconds at ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan()
> > functions. Since the kswapd can call these functions later, the current
> > thread can return from these functions as soon as chosen by the OOM killer.
> > 
> > This patch changes "mutex_lock();" to "if (mutex_lock_killable()) return ...;"
> > so that any threads can promptly give up. (By the way, as far as I tested,
> > changing to "if (!mutex_trylock()) return ...;" likely shortens the duration
> > of stall. Maybe we don't need to wait for mutex if someone is already calling
> > these functions.)
> > 
> 
> While discussing about XFS problem, I got a question. Is it OK (from point
> of view of reentrant) to use mutex_lock() or mutex_lock_killable() inside
> shrinker's entry point functions? Can senario shown below possible?
> 
> (1) kswapd is doing memory reclaim which does not need to hold mutex.
> 
> (2) Someone in GFP_KERNEL context (not kswapd) calls
>     ttm_dma_pool_shrink_count() and then calls ttm_dma_pool_shrink_scan()
>     from direct reclaim path.
> 
> (3) Inside ttm_dma_pool_shrink_scan(), GFP_KERNEL allocation is issued
>     while mutex is held by the someone.
> 
> (4) GFP_KERNEL allocation cannot be completed immediately due to memory
>     pressure.
> 
> (5) kswapd calls ttm_dma_pool_shrink_count() which need to hold mutex.
> 
> (6) Inside ttm_dma_pool_shrink_count(), kswapd is blocked waiting for
>     mutex held by the someone, and the someone is waiting for GFP_KERNEL
>     allocation to complete, but GFP_KERNEL allocation cannot be completed
>     until mutex held by the someone is released?
> 
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel