Fwd: __i915_gem_shrink / mm_find_pmd hogging CPU, then out of memory

Sam Jansen <sam.jansen@xxxxxxxxxxxx> · Mon, 2 Jun 2014 14:18:14 +0100

Hello intel-gfx,

I'm working on an application using VA-API for H264 encode+decode, and JPEG decode on an Atom E3815. Unfortunately we've hit what I believe is a kernel bug, and the "perf top" output is pointing at i915 DRM code.

After some amount of time running my application, the system will become unresponsive (userspace applications get very little CPU, system CPU will go up to 80+%), and sometimes the system will appear out of memory for a period (the OOM killer is sometimes invoked), even though there is a lot of free memory on the system. I noticed this first on kernel 3.14.5, so I moved to "drm-intel-nightly", built on Friday (2014-05-30), from http://cgit.freedesktop.org/drm-intel. The results are the same.

Using "perf top", I have gathered the following traces showing high system CPU at the time when the system was encountering this problem:

Samples: 15M of event 'cycles', Event count (approx.): 26497228423       
+   38.67%  [kernel]               [k] __i915_gem_shrink                 
+   14.89%  [kernel]               [k] list_move_tail                    
+    5.97%  [kernel]               [k] i915_gem_shrinker_count           
+    3.62%  [kernel]               [k] i915_gem_obj_is_pinned            
+    3.56%  [kernel]               [k] clflushopt                        
+    3.02%  perf                   [.] rb_next                           
+    1.16%  perf                   [.] __sort_chain_graph_rel            
+    1.10%  [kernel]               [k] clear_page_c_e                    
+    0.94%  [kernel]               [k] drm_clflush_page                  
+    0.83%  perf                   [.] rb_first                          
+    0.79%  perf                   [.] dso__find_symbol                  
+    0.71%  [kernel]               [k] delay_tsc                         
+    0.44%  perf                   [.] append_chain_children             
+    0.34%  perf                   [.] rb_insert_color                   
+    0.31%  i965_drv_video.so      [.] i965_post_processing_internal     
+    0.28%  [kernel]               [k] mm_find_pmd                       
+    0.27%  [kernel]               [k] xhci_irq                          
+    0.26%  perf                   [.] hists__output_resort              
+    0.26%  i965_drv_video.so      [.] gen7_pp_avs_set_block_parameter   
+    0.24%  [kernel]               [k] __vlv_force_wake_get              
+    0.23%  libc-2.18.so           [.] _int_malloc                       
+    0.23%  [kernel]               [k] __sg_page_iter_next               
+    0.22%  [kernel]               [k] __wake_up_bit                     
+    0.22%  [kernel]               [k] do_raw_spin_lock                  
+    0.22%  [kernel]               [k] get_page_from_freelist            
+    0.21%  [kernel]               [k] __page_check_address              
+    0.20%  perf                   [.] sort__dso_cmp                     
+    0.19%  [kernel]               [k] __acct_update_integrals    

Maybe 5-10 seconds later, this had moved to look like the following:

Samples: 15M of event 'cycles', Event count (approx.): 11020159263       
+    7.31%  perf                   [.] rb_next                           
+    4.72%  [kernel]               [k] mm_find_pmd                       
+    3.12%  perf                   [.] __sort_chain_graph_rel            
+    2.86%  [kernel]               [k] __page_check_address              
+    2.67%  [kernel]               [k] super_cache_count                 
+    2.15%  perf                   [.] rb_first                          
+    2.12%  [kernel]               [k] vma_interval_tree_iter_next       
+    1.88%  [kernel]               [k] do_raw_spin_lock                  
+    1.56%  [kernel]               [k] clflushopt                        
+    1.48%  perf                   [.] dso__find_symbol                  
+    1.28%  perf                   [.] append_chain_children             
+    1.28%  [kernel]               [k] vma_interval_tree_subtree_search  
+    1.15%  [kernel]               [k] rmap_walk                         
+    1.10%  [kernel]               [k] shrink_slab_node                  
+    1.05%  [kernel]               [k] page_referenced_one               
+    0.88%  perf                   [.] rb_insert_color                   
+    0.87%  [kernel]               [k] shrink_slab                       
+    0.86%  [kernel]               [k] clear_page_c_e                    
+    0.85%  [kernel]               [k] grab_super_passive                
+    0.74%  [kernel]               [k] __vma_address                     
+    0.69%  [kernel]               [k] __wake_up_bit                     
+    0.69%  [kernel]               [k] mem_cgroup_page_lruvec            
+    0.68%  [kernel]               [k] get_page_from_freelist            
+    0.65%  [kernel]               [k] page_referenced                   
+    0.65%  [kernel]               [k] isolate_lru_pages.isra.49         
+    0.62%  [kernel]               [k] shrink_page_list                  
+    0.61%  perf                   [.] hists__output_resort              
+    0.61%  [kernel]               [k] list_lru_count_node               
+    0.59%  [kernel]               [k] xhci_irq                          
+    0.52%  [kernel]               [k] vma_last_pgoff                    
+    0.49%  [kernel]               [k] __mod_zone_page_state             
+    0.48%  libm-2.18.so           [.] __ceil_sse41                      
+    0.47%  [kernel]               [k] _raw_spin_lock                   

The system was still largely unresponsive at this time - e.g. "perf top", "top", or even the shell might take seconds to respond to user input. Eventually, "perf" was killed by the OOM killer. dmesg has has some interesting lines in it ("Purging GPU memory"). Please see the dmesg and vmstat output at the end of this email.

After running this a couple of times, I noticed the following trends in "free" and "vmstat" output: free memory starts at 430M, and cached at 828M. Eventually free calls to 64M, and cached rises to >1000M (total RAM is 2GB, as you will see in the OOM output in dmesg, only a small amount of that is in use by userspace apps). Around this point, vmstat shows "bi" (blocks received from block devices) goes up from 0 to a few thousand per second, and does system CPU%, from 16-17 to 70-80.

I would be happy to retest with a different kernel and/or config, or increase debugging if that helps.

Thanks,
Sam

vmstat:

procs -----------memory---------- ---swap-- -----io---- -system---   ------cpu--------
 r  b swpd free   buff  cache     si   so    bi    bo   in      cs   us  sy  id  wa st

 0  0  0  376696  12580  860280   0    0     0      0   4013   765   21  16  26  36  0
 1  1  0  375864  12580  860760   0    0     0      0   4050   772   21  16  28  34  0

 0  1  0  374908  12580  861304   0    0     0      0   3991   762   21  16  27  36  0
 0  0  0  374232  12584  861784   0    0     0      0   4039   776   21  15  28  36  0

 0  1  0  373292  12584  862368   0    0     0      0   4016   775   22  17  27  35  0
... snip ...
 0  0  0  65264   304   1030056   0    0     0      0   4134   794   24  16  26  34  0

 0  1  0  64440   312   1030536   0    0     0      0   4258   778   26  16  26  32  0
 4  0  0  68180   312   1025992   0    0     0      0   6185   639   24  42  13  21  0

 4  1  0  71172   64    1024016   0    0     1928   0   8010   590   11  78  0   11  0
 2  0  0  68768   60    1024792   0    0     3440   0   11010  787   18  80  0   2   0

 4  0  0  73456   56    1023640   0    0     3108   0   9140   807   18  78  0   5   0
 2  0  0  71704   64    1025916   0    0     2748   0   8752   674   20  79  0   1   0

 3  0  0  74720   64    1022316   0    0     3136   0   8530   756   16  74  0   10  0
 2  2  0  81592   64    1015584   0    0     4096   0   12065  945   7   84  0   9   0

 2  0  0  81896   56    1015332   0    0     4860   0   11592  1008  13  83  0   4   0
 0  1  0  72244   56    1025908   0    0     3360   0   9384   912   15  80  0   5   0

 0  1  0  73580   52    1025928   0    0     2400   0   8276   916   8   80  0   13  0
 0  2  0  72696   60    1026316   0    0     4992   0   10721  1172  6   84  0   10  0

 2  3  0  78844   64    1021024   0    0     5988   0   8707   920   6   84  0   10  0
 1  2  0  81328   60    1019996   0    0     5020   0   7967   1194  8   75  0   17  0

 3  0  0  73288   64    1028360   0    0     5980   0   7824   1560  17  59  0   24  0
 3  2  0  80476   60    1021044   0    0     4992   0   8309   1548  22  61  2   16  0

 0  1  0  78924   68    1026952   0    0     2792   0   7319   1472  30  45  2   23  0
 0  0  0  68908   632   1035644   0    0     1144   0   4421   853   27  17  22  34  0

 0  0  0  67596   640   1036700   0    0     72     28  4374   816   26  17  25  32  0
 1  1  0  66912   640   1037252   0    0     0      0   4293   761   27  16  25  33  0

 0  1  0  65956   640   1037760   0    0     0      0   4195   767   26  16  26  32  0
 1  1  0  65248   648   1038320   0    0     0      12  3590   799   18  15  27  39  0

 0  0  0  64564   648   1038784   0    0     0      0   2743   787   7   15  33  45  0
 0  0  0  63856   648   1039312   0    0     0      8   2841   819   9   16  33  43  0

 0  1  0  68588   60    1034984   0    0     2260   0   6159   1508  12  48  5   36  0
 0  2  0  78016   64    1024456   0    0     7468   0   7740   1715  24  50  0   26  0

 1  1  0  65312   56    1036944   0    0     11452  12  6577   1876  11  48  0   42  0
 0  3  0  73876   52    1027820   0    0     17108  0   6994   2136  8   49  0   43  0

 4  0  0  66688   52    1033736   0    0     15624  12  7419   2047  23  47  0   31  0
 0  2  0  73032   56    1026724   0    0     27480  0   6056   2449  15  32  0   53  0

 0  1  0  75000   52    1024268   0    0     22752  0   5869   2336  6   39  1   54  0
 0  2  0  71720   52    1025556   0    0     35008  0   4508   2233  2   25  0   73  0

 0  3  0  73704   52    1022796   0    0     34648  4   4107   1911  6   21  0   74  0
 1  2  0  70968   52    1023524   0    0     36548  4   5040   2103  15  21  0   64  0

 0  3  0  66632   68    1026176   0    0     41928  44  5139   2641  2   24  0   74  0
 0  2  0  67932   52    1022424   0    0     38676  4   4604   2292  3   26  0   71  0

 2  2  0  64516   52    1024708   0    0     44780  0   5345   2139  17  24  0   60  0
 1  4  0  64008   52    1023216   0    0     50920  0   5242   2535  8   26  0   65  0

 0  2  0  63256   52    1021840   0    0     35436  4   4448   2250  4   26  0   70  0
 0  1  0  60516   52    1022780   0    0     33648  0   4557   2347  4   26  0   70  0

 4  2  0  56156   52    1022756   0    0     46604  0   5921   2769  11  27  0   62  0
 2  1  0  345644  472   1045888   0    0     13064  16  1973   1627  6   21  8   64  0

 0  1  0  341956  1428  1048312   0    0     0      0   602    765   6   14  35  45  0
 0  1  0  341388  1428  1048856   0    0     0      0   600    766   6   14  36  44  0

dmesg:

[244227.556881] Purging GPU memory, 2 bytes freed, 8429568 bytes still pinned.

[244227.557018] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.
[244227.557129] perf invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

[244227.557135] perf cpuset=/ mems_allowed=0
[244227.557146] CPU: 0 PID: 9481 Comm: perf Tainted: G        W   E 3.15.0-rc7-sl-01023-g085391259 #4

[244227.557151] Hardware name: \xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xfffff

fff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xfff

fffff\xffffffff\xffffffff \xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\x

ffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff

\xffffffff\xffffffff/DE3815TYKH, BIOS TYBYT10H.86A.0019.2014.0327.1516 03/27/201
[244227.557155]  0000000000000000 ffff88007014d968 ffffffff81669271 ffff88005adae480

[244227.557163]  ffff88007014d9f0 ffffffff816665fd 0000000000015d40 0001456a0b81c281
[244227.557169]  ffff88007014d9d0 ffffffff813a4180 000000000080a000 01ffffff8114d2f8

[244227.557176] Call Trace:
[244227.557190]  [<ffffffff81669271>] dump_stack+0x45/0x56
[244227.557197]  [<ffffffff816665fd>] dump_header+0x7a/0x1dc

[244227.557206]  [<ffffffff813a4180>] ? i915_gem_shrinker_oom+0x15e/0x18d
[244227.557215]  [<ffffffff81141ad9>] oom_kill_process+0x6d/0x2f7

[244227.557221]  [<ffffffff811422da>] out_of_memory+0x418/0x431
[244227.557229]  [<ffffffff81146a25>] __alloc_pages_nodemask+0x646/0x7ad

[244227.557239]  [<ffffffff81172124>] alloc_pages_current+0xc6/0xe1
[244227.557247]  [<ffffffff8113ed0d>] __page_cache_alloc+0x85/0x90

[244227.557254]  [<ffffffff81140d13>] filemap_fault+0x239/0x3a8
[244227.557262]  [<ffffffff8115cfda>] __do_fault+0x3b/0x89

[244227.557269]  [<ffffffff8115eb43>] do_read_fault.isra.73+0x171/0x1f5
[244227.557276]  [<ffffffff8115ff93>] handle_mm_fault+0x3eb/0xd58

[244227.557284]  [<ffffffff81132493>] ? perf_event_aux_ctx+0x42/0x59
[244227.557293]  [<ffffffff8167241d>] __do_page_fault+0x2aa/0x39b

[244227.557302]  [<ffffffff8110e295>] ? acct_account_cputime+0x1c/0x1e
[244227.557309]  [<ffffffff810b9b17>] ? account_system_time+0x102/0x10b

[244227.557318]  [<ffffffff8103bf53>] ? native_sched_clock+0x35/0x37
[244227.557325]  [<ffffffff8110e295>] ? acct_account_cputime+0x1c/0x1e

[244227.557342]  [<ffffffff810b9a10>] ? account_user_time+0x46/0x4b
[244227.557349]  [<ffffffff810b9d9f>] ? vtime_account_user+0x46/0x4b

[244227.557356]  [<ffffffff81672549>] do_page_fault+0x3b/0x56
[244227.557364]  [<ffffffff8166f7a2>] page_fault+0x22/0x30

[244227.557368] Mem-Info:
[244227.557372] Node 0 DMA per-cpu:
[244227.557377] CPU    0: hi:    0, btch:   1 usd:   0

[244227.557381] Node 0 DMA32 per-cpu:
[244227.557385] CPU    0: hi:  186, btch:  31 usd:   0
[244227.557395] active_anon:130938 inactive_anon:256046 isolated_anon:0

[244227.557395]  active_file:13 inactive_file:72 isolated_file:0
[244227.557395]  unevictable:0 dirty:0 writeback:1 unstable:0

[244227.557395]  free:13132 slab_reclaimable:25378 slab_unreclaimable:48246
[244227.557395]  mapped:3895 shmem:256101 pagetables:993 bounce:0

[244227.557395]  free_cma:0
[244227.557406] Node 0 DMA free:7876kB min:368kB low:460kB high:552kB active_anon:6736kB inactive_anon:632kB active_file:4kB inactive_file:84kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15924kB managed:15840kB mlocked:0kB dirty:0kB writeback:0kB mapped:200kB shmem:632kB slab_reclaimable:96kB slab_unreclaimable:132kB kernel_stack:8kB pagetables:36kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:148 all_unreclaimable? yes

[244227.557420] lowmem_reserve[]: 0 1877 1877 1877
[244227.557427] Node 0 DMA32 free:44652kB min:44684kB low:55852kB high:67024kB active_anon:517016kB inactive_anon:1023552kB active_file:48kB inactive_file:204kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1972064kB managed:1925928kB mlocked:0kB dirty:0kB writeback:4kB mapped:15380kB shmem:1023772kB slab_reclaimable:101416kB slab_unreclaimable:192852kB kernel_stack:704kB pagetables:3936kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:436 all_unreclaimable? yes

[244227.557440] lowmem_reserve[]: 0 0 0 0
[244227.557447] Node 0 DMA: 3*4kB (UE) 20*8kB (EM) 16*16kB (EM) 9*32kB (EM) 4*64kB (UEM) 6*128kB (UEM) 4*256kB (UEM) 4*512kB (UE) 1*1024kB (E) 1*2048kB (R) 0*4096kB = 7884kB

[244227.557476] Node 0 DMA32: 5553*4kB (UEM) 687*8kB (UEM) 349*16kB (UEM) 48*32kB (UEM) 106*64kB (UEM) 8*128kB (EM) 0*256kB 0*512kB 0*1024kB 1*2048kB (R) 0*4096kB = 44684kB

[244227.557504] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[244227.557508] 256203 total pagecache pages

[244227.557511] 496997 pages RAM
[244227.557514] 0 pages HighMem/MovableOnly
[244227.557517] 11534 pages reserved

[244227.557520] 0 pages hwpoisoned
[244227.557523] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name

[244227.557538] [  429]     0   429    10231      202      23        0         -1000 udevd
[244227.557545] [ 1264]     0  1264     3705     1728      10        0             0 dhclient

[244227.557552] [ 1432]     0  1432    65700      197      30        0             0 rsyslogd
[244227.557558] [ 1477]     0  1477     5911       53      17        0             0 cron

[244227.557575] [ 1498]   102  1498     8329      147      21        0             0 ntpd
[244227.557582] [ 1513]     0  1513    13743      167      29        0         -1000 sshd

[244227.557589] [ 1544]     0  1544     5942      134      15        0             0 nginx
[244227.557595] [ 1547] 65534  1547     6042      236      15        0             0 nginx

[244227.557601] [ 1548] 65534  1548     6042      240      15        0             0 nginx
[244227.557607] [ 1550]     0  1550     3721       39      13        0             0 getty

[244227.557614] [ 1551]     0  1551    15263      111      35        0             0 login
[244227.557620] [ 1552]     0  1552     3721       40      12        0             0 getty

[244227.557627] [ 1553]     0  1553     3721       41      12        0             0 getty
[244227.557633] [ 1554]     0  1554     3721       39      12        0             0 getty

[244227.557639] [ 1555]     0  1555     3721       41      13        0             0 getty
[244227.557646] [ 1564]     0  1564     4459       53      14        0             0 bash

[244227.557652] [ 1567]     0  1567     1078       29       7        0             0 startx
[244227.557658] [ 1584]     0  1584     3972       37      13        0             0 xinit

[244227.557664] [ 1585]     0  1585    36171     1275      69        0             0 Xorg
[244227.557670] [ 1590]     0  1590     1078       25       7        0             0 sh

[244227.557677] [ 1591]     0  1591     5541       67      17        0             0 dwm
[244227.557683] [ 1642]     0  1642    18136      245      41        0             0 xterm

[244227.557689] [ 1644]     0  1644     4527      131      15        0             0 bash
[244227.557696] [ 2756]     0  2756    20626      219      44        0             0 sshd

[244227.557714] [ 3812]     0  3812     4528      141      13        0             0 bash
[244227.557721] [ 9472]     0  9472   102030    78463     180        0             0 perf

[244227.557727] [24527]     0 24527    81704    49573     152        0             0 Main_(called_sl
[244227.557733] [24556]     0 24556    20626      233      43        0             0 sshd

[244227.557739] [24558]     0 24558     2375       76      10        0             0 bash
[244227.557746] [ 2351]     0  2351     2375       77      10        0             0 bash

[244227.557753] [ 2352]     0  2352     2120       33      10        0             0 vmstat
[244227.557759] [ 2353]     0  2353     1063       19       8        0             0 tail

[244227.557764] Out of memory: Kill process 9472 (perf) score 157 or sacrifice child
[244227.557769] Killed process 9472 (perf) total-vm:408120kB, anon-rss:313852kB, file-rss:0kB

[244227.558083] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.
[244227.560227] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.

[244227.560471] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.
[244227.560666] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.

[244227.560859] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.
[244227.561051] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.

[244227.561227] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.
[244227.561419] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.

[244227.561612] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.
[244227.561806] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.

[244227.561983] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.
[244227.562223] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.

[244227.562420] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.
[244227.562611] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.

[244227.562804] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.
[244227.562979] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.

[244227.563170] Purging GPU memory, 0 bytes freed, 8429568 bytes still pinned.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx