On 06/28/2012 06:59 AM, Jim Schutt wrote: > Hi, > > I'm running into trouble with systems going unresponsive, > and perf suggests it's excessive CPU usage by isolate_freepages(). > I'm currently testing 3.5-rc4, but I think this problem may have > first shown up in 3.4. I'm only just learning how to use perf, > so I only currently have results to report for 3.5-rc4. > > (FWIW I'm running the my distro version of perf; please let me know > if I need to compile the tools/perf version to match my kernel.) > > The systems in question have 24 SAS drives spread across 3 HBAs, > running 24 Ceph OSD instances, one per drive. FWIW these servers > are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160 > Ceph Linux clients doing dd simultaneously to a Ceph file system > backed by 12 of these servers. > > In the early phase of such a test, when things are running well, > here's what vmstat reports for the state of one of these servers: > > 2012-06-27 13:56:58.356-06:00 > vmstat -w 4 16 > procs -------------------memory------------------ ---swap-- -----io---- > --system-- -----cpu------- > r b swpd free buff cache si so bi > bo in cs us sy id wa st > 31 15 0 287216 576 38606628 0 0 2 > 1158 2 14 1 3 95 0 0 > 27 15 0 225288 576 38583384 0 0 18 > 2222016 203357 134876 11 56 17 15 0 > 28 17 0 219256 576 38544736 0 0 11 > 2305932 203141 146296 11 49 23 17 0 > 6 18 0 215596 576 38552872 0 0 7 > 2363207 215264 166502 12 45 22 20 0 > 22 18 0 226984 576 38596404 0 0 3 > 2445741 223114 179527 12 43 23 22 0 > 30 12 0 230844 576 38461648 0 0 14 > 2298537 216580 166661 12 45 23 20 0 > 29 19 0 237856 576 38422884 0 0 5 > 2332741 209408 157138 12 42 25 22 0 > 17 11 0 222156 576 38483992 0 0 4 > 2380495 210312 173121 12 39 28 22 0 > 11 13 0 216152 576 38463872 0 0 44 > 2362186 215236 176454 12 39 27 22 0 > 12 14 0 223704 576 38546720 0 0 2 > 2395530 214684 177518 12 39 28 22 0 > 39 17 0 219932 576 38598184 0 0 4 > 2428231 223284 179095 12 42 24 21 0 > 11 10 0 219956 576 38521048 0 0 4 > 2323853 207472 166665 12 39 28 21 0 > 23 15 0 216580 576 38451904 0 0 3 > 2241800 201049 163496 11 37 31 21 0 > 9 13 0 225792 576 38451916 0 0 13 > 2281900 204869 171814 11 36 30 23 0 > 14 12 0 233820 576 38492728 0 0 4 > 2293423 207686 173019 11 37 31 21 0 > 11 20 0 213796 576 38533208 0 0 3 > 2288636 205605 168118 11 37 31 21 0 > > > The system begins to struggle over the next several > minutes; here's what vmstat has to say: > > 2012-06-27 13:57:58.831-06:00 > vmstat -w 4 16 > procs -------------------memory------------------ ---swap-- -----io---- > --system-- -----cpu------- > r b swpd free buff cache si so bi > bo in cs us sy id wa st > 21 16 0 224628 576 38526872 0 0 2 > 1233 9 19 1 4 95 0 0 > 12 19 0 232060 576 38501020 0 0 4 > 2366769 221418 159890 12 48 20 19 0 > 124 8 0 218548 576 38379656 0 0 13 > 2103075 199660 108618 11 65 13 12 0 > 24 10 0 300476 576 38230288 0 0 31 > 1966282 177472 84572 10 76 7 7 0 > 20 16 0 217584 576 38296700 0 0 9 > 2062571 195936 128810 10 55 20 15 0 > 53 12 0 235720 576 38247968 0 0 30 > 2035407 196973 133921 10 52 23 15 0 > 20 16 0 360340 576 38067992 0 0 6 > 2192179 208692 136784 11 54 19 15 0 > 26 10 0 310800 576 38093884 0 0 43 > 2138772 207105 118718 11 64 12 13 0 > 24 15 0 261108 576 38030828 0 0 68 > 2174015 205793 135302 11 56 18 15 0 > 9 17 0 241816 576 37982072 0 0 20 > 2076145 194971 120285 10 60 16 14 0 > 37 15 0 255972 576 37892868 0 0 14 > 2225076 205694 126954 11 59 16 13 0 > 27 16 0 243212 576 37872704 0 0 6 > 2249476 210885 134684 12 60 15 14 0 > 30 10 0 217572 576 37795388 0 0 3 > 2128688 205027 118319 11 66 12 11 0 > 26 11 0 236420 576 37740164 0 0 23 > 2109709 205105 133925 10 56 19 14 0 > 45 15 0 330056 576 37619896 0 0 15 > 1948311 196188 119330 10 62 15 13 0 > 54 15 0 242696 576 37631500 0 0 4 > 2159530 202595 132588 11 59 16 15 0 > > 2012-06-27 13:58:59.569-06:00 > vmstat -w 4 16 > procs -------------------memory------------------ ---swap-- -----io---- > --system-- -----cpu------- > r b swpd free buff cache si so bi > bo in cs us sy id wa st > 14 15 0 274932 576 37621548 0 0 2 > 1301 15 0 1 4 95 0 0 > 44 16 0 278748 576 37509516 0 0 6 > 2097643 196722 112697 10 66 12 11 0 > 88 14 0 228088 576 37412008 0 0 6 > 2089559 202206 116146 10 66 13 11 0 > 94 11 0 348348 576 37270624 0 0 7 > 1906390 181488 87333 9 76 9 6 0 > 42 11 0 215996 576 37288556 0 0 6 > 1782459 184375 95901 9 74 9 7 0 > 45 11 0 323112 576 37146500 0 0 12 > 1868376 187814 103947 9 71 11 9 0 > 51 7 0 244560 576 37100124 0 0 35 > 1767496 181383 107259 9 68 12 11 0 > 74 12 0 221584 576 37016420 0 0 18 > 1884986 183376 93425 9 75 9 7 0 > 45 10 0 275564 576 36985324 0 0 23 > 1683688 167223 97036 8 75 9 8 0 > 19 10 0 322176 576 36813176 0 0 14 > 1747378 177594 97218 8 72 12 8 0 > 122 7 0 225256 576 36838084 0 0 26 > 1730643 177915 92621 8 75 9 8 0 > 243 10 0 223464 576 36765460 0 0 18 > 1730158 173059 79373 8 80 6 5 0 > 100 10 0 307528 576 36598456 0 0 4 > 1738567 174077 79585 9 82 6 4 0 > 243 6 0 370064 576 36358576 0 0 9 > 1586528 174680 85353 8 81 6 5 0 > 267 2 0 322640 576 36254044 0 0 40 > 1011650 129389 42277 5 93 1 1 0 > 210 4 0 505092 576 35865460 0 0 25 > 720825 116356 32422 3 96 1 0 0 > > 2012-06-27 14:00:03.219-06:00 > vmstat -w 4 16 > procs -------------------memory------------------ ---swap-- -----io---- > --system-- -----cpu------- > r b swpd free buff cache si so bi > bo in cs us sy id wa st > 75 1 0 566988 576 35664800 0 0 2 > 1355 21 3 1 4 95 0 0 > 433 1 0 964052 576 35069112 0 0 7 > 456359 102256 20901 2 98 0 0 0 > 547 3 0 820116 576 34893932 0 0 57 > 560507 114878 28115 3 96 0 0 0 > 806 2 0 606992 576 34848180 0 0 339 > 309668 101230 21056 2 98 0 0 0 > 708 1 0 529624 576 34708000 0 0 248 > 370886 101327 20062 2 97 0 0 0 > 231 5 0 504772 576 34663880 0 0 305 > 334824 95045 20407 2 97 1 1 0 > 158 6 0 1063088 576 33518536 0 0 531 > 847435 130696 47140 4 92 1 2 0 > 193 0 0 1449156 576 33035572 0 0 363 > 371279 94470 18955 2 96 1 1 0 > 266 6 0 1623512 576 32728164 0 0 77 > 241114 95730 15483 2 98 0 0 0 > 243 8 0 1629504 576 32653080 0 0 81 > 471018 100223 20920 3 96 0 1 0 > 70 11 0 1342140 576 33084020 0 0 100 925869 > 139876 56599 6 88 3 3 0 > 211 7 0 1130316 576 33470432 0 0 290 > 1008984 150699 74320 6 83 6 5 0 > 365 3 0 776736 576 34072772 0 0 182 > 747167 139436 67135 5 88 4 3 0 > 29 1 0 1528412 576 34110640 0 0 50 612181 > 137403 77609 4 87 6 3 0 > 266 5 0 1657688 576 34105696 0 0 3 > 258307 62879 38508 2 93 3 2 0 > 1159 2 0 2002256 576 33775476 0 0 19 > 88554 42112 14230 1 98 0 0 0 > > > Right around 14:00 I was able to get a "perf -a -g"; here's the > beginning of what "perf report --sort symbol --call-graph fractal,5" > had to say: > > # > 64.86% [k] _raw_spin_lock_irqsave > | > |--97.94%-- isolate_freepages > | compaction_alloc > | unmap_and_move > | migrate_pages > | compact_zone > | | > | |--99.56%-- try_to_compact_pages > | | __alloc_pages_direct_compact > | | __alloc_pages_slowpath > | | __alloc_pages_nodemask > | | alloc_pages_vma > | | do_huge_pmd_anonymous_page > | | handle_mm_fault > | | do_page_fault > | | page_fault > | | | > | | |--53.53%-- skb_copy_datagram_iovec > | | | tcp_recvmsg > | | | inet_recvmsg > | | | sock_recvmsg > | | | sys_recvfrom > | | | system_call_fastpath > | | | __recv > | | | | > | | | --100.00%-- (nil) > | | | > | | |--27.80%-- __pthread_create_2_1 > | | | (nil) > | | | > | | --18.67%-- memcpy > | | | > | | |--57.38%-- 0x50d000005 > | | | > | | |--34.52%-- 0x3b300bf271940a35 > | | | > | | --8.10%-- 0x1500000000000009 > | --0.44%-- [...] > --2.06%-- [...] > > 6.15% [k] isolate_freepages_block > | > |--99.95%-- isolate_freepages > | compaction_alloc > | unmap_and_move > | migrate_pages > | compact_zone > | | > | |--99.54%-- try_to_compact_pages > | | __alloc_pages_direct_compact > | | __alloc_pages_slowpath > | | __alloc_pages_nodemask > | | alloc_pages_vma > | | do_huge_pmd_anonymous_page > | | handle_mm_fault > | | do_page_fault > | | page_fault > | | | > | | |--54.40%-- skb_copy_datagram_iovec > | | | tcp_recvmsg > | | | inet_recvmsg > | | | sock_recvmsg > | | | sys_recvfrom > | | | system_call_fastpath > | | | __recv > | | | | > | | | --100.00%-- (nil) > | | | > | | |--25.19%-- __pthread_create_2_1 > | | | (nil) > | | | > | | --20.41%-- memcpy > | | | > | | |--40.24%-- 0x3b300bf271940a35 > | | | > | | |--38.29%-- 0x1500000000000009 > | | | > | | --21.47%-- 0x50d000005 > | --0.46%-- [...] > --0.05%-- [...] > > 3.96% [.] ceph_crc32c_le > | > |--99.99%-- 0xb8057558d0065990 > --0.01%-- [...] > > A different system in the same test had a slightly different > call tree, but isolate_freepages() still seems to show up > prominently: > > # > 32.32% [k] _raw_spin_lock_irqsave > | > |--97.64%-- isolate_freepages > | compaction_alloc > | unmap_and_move > | migrate_pages > | compact_zone > | try_to_compact_pages > | __alloc_pages_direct_compact > | __alloc_pages_slowpath > | __alloc_pages_nodemask > | alloc_pages_vma > | do_huge_pmd_anonymous_page > | handle_mm_fault > | do_page_fault > | page_fault > | | > | |--65.31%-- skb_copy_datagram_iovec > | | tcp_recvmsg > | | inet_recvmsg > | | sock_recvmsg > | | sys_recvfrom > | | system_call_fastpath > | | __recv > | | | > | | --100.00%-- (nil) > | | > | |--30.98%-- memcpy > | | | > | | |--50.60%-- 0x50d0000 > | | | > | | --49.40%-- 0x50d000005 > | --3.70%-- [...] > --2.36%-- [...] > > 17.10% [k] _raw_spin_lock_irq > | > |--98.27%-- isolate_migratepages_range > | compact_zone > | try_to_compact_pages > | __alloc_pages_direct_compact > | __alloc_pages_slowpath > | __alloc_pages_nodemask > | alloc_pages_vma > | do_huge_pmd_anonymous_page > | handle_mm_fault > | do_page_fault > | page_fault > | | > | |--99.85%-- __pthread_create_2_1 > | | (nil) > | --0.15%-- [...] > --1.73%-- [...] > > 4.59% [k] mutex_spin_on_owner > | > --- __mutex_lock_slowpath > mutex_lock > | > |--50.14%-- page_lock_anon_vma > | | > | |--99.99%-- try_to_unmap_anon > | | try_to_unmap > | | __unmap_and_move > | | unmap_and_move > | | migrate_pages > | | compact_zone > | | try_to_compact_pages > | | __alloc_pages_direct_compact > | | __alloc_pages_slowpath > | | __alloc_pages_nodemask > | | alloc_pages_vma > | | do_huge_pmd_anonymous_page > | | handle_mm_fault > | | do_page_fault > | | page_fault > | | | > | | |--99.91%-- __pthread_create_2_1 > | | | (nil) > | | --0.09%-- [...] > | --0.01%-- [...] > | > |--49.67%-- rmap_walk > | move_to_new_page > | __unmap_and_move > | unmap_and_move > | migrate_pages > | compact_zone > | try_to_compact_pages > | __alloc_pages_direct_compact > | __alloc_pages_slowpath > | __alloc_pages_nodemask > | alloc_pages_vma > | do_huge_pmd_anonymous_page > | handle_mm_fault > | do_page_fault > | page_fault > | | > | |--99.69%-- __pthread_create_2_1 > | | (nil) > | --0.31%-- [...] > --0.20%-- [...] > > 4.10% [k] isolate_freepages_block > | > |--99.95%-- isolate_freepages > | compaction_alloc > | unmap_and_move > | migrate_pages > | compact_zone > | try_to_compact_pages > | __alloc_pages_direct_compact > | __alloc_pages_slowpath > | __alloc_pages_nodemask > | alloc_pages_vma > | do_huge_pmd_anonymous_page > | handle_mm_fault > | do_page_fault > | page_fault > | | > | |--46.97%-- skb_copy_datagram_iovec > | | tcp_recvmsg > | | inet_recvmsg > | | sock_recvmsg > | | sys_recvfrom > | | system_call_fastpath > | | __recv > | | | > | | --100.00%-- (nil) > | | > | |--31.79%-- __pthread_create_2_1 > | | (nil) > | | > | --21.24%-- memcpy > | | > | |--61.90%-- 0x50d000005 > | | > | --38.10%-- 0x50d0000 > --0.05%-- [...] > > 3.65% [.] ceph_crc32c_le > | > |--99.86%-- 0xb8057558d0065990 > --0.14%-- [...] > > > > I seem to be able to recreate this issue at will, so please > let me know what I can do to help learn what is going on. > > Thanks -- Jim I doubt compaction try to migrate continuously although we have no free memory. Could you apply this patch and retest? https://lkml.org/lkml/2012/6/21/30 Thanks. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html