Hi, I'm running into trouble with systems going unresponsive, and perf suggests it's excessive CPU usage by isolate_freepages(). I'm currently testing 3.5-rc4, but I think this problem may have first shown up in 3.4. I'm only just learning how to use perf, so I only currently have results to report for 3.5-rc4. (FWIW I'm running the my distro version of perf; please let me know if I need to compile the tools/perf version to match my kernel.) The systems in question have 24 SAS drives spread across 3 HBAs, running 24 Ceph OSD instances, one per drive. FWIW these servers are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160 Ceph Linux clients doing dd simultaneously to a Ceph file system backed by 12 of these servers. In the early phase of such a test, when things are running well, here's what vmstat reports for the state of one of these servers: 2012-06-27 13:56:58.356-06:00 vmstat -w 4 16 procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0 27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0 28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0 6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0 22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0 30 12 0 230844 576 38461648 0 0 14 2298537 216580 166661 12 45 23 20 0 29 19 0 237856 576 38422884 0 0 5 2332741 209408 157138 12 42 25 22 0 17 11 0 222156 576 38483992 0 0 4 2380495 210312 173121 12 39 28 22 0 11 13 0 216152 576 38463872 0 0 44 2362186 215236 176454 12 39 27 22 0 12 14 0 223704 576 38546720 0 0 2 2395530 214684 177518 12 39 28 22 0 39 17 0 219932 576 38598184 0 0 4 2428231 223284 179095 12 42 24 21 0 11 10 0 219956 576 38521048 0 0 4 2323853 207472 166665 12 39 28 21 0 23 15 0 216580 576 38451904 0 0 3 2241800 201049 163496 11 37 31 21 0 9 13 0 225792 576 38451916 0 0 13 2281900 204869 171814 11 36 30 23 0 14 12 0 233820 576 38492728 0 0 4 2293423 207686 173019 11 37 31 21 0 11 20 0 213796 576 38533208 0 0 3 2288636 205605 168118 11 37 31 21 0 The system begins to struggle over the next several minutes; here's what vmstat has to say: 2012-06-27 13:57:58.831-06:00 vmstat -w 4 16 procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 21 16 0 224628 576 38526872 0 0 2 1233 9 19 1 4 95 0 0 12 19 0 232060 576 38501020 0 0 4 2366769 221418 159890 12 48 20 19 0 124 8 0 218548 576 38379656 0 0 13 2103075 199660 108618 11 65 13 12 0 24 10 0 300476 576 38230288 0 0 31 1966282 177472 84572 10 76 7 7 0 20 16 0 217584 576 38296700 0 0 9 2062571 195936 128810 10 55 20 15 0 53 12 0 235720 576 38247968 0 0 30 2035407 196973 133921 10 52 23 15 0 20 16 0 360340 576 38067992 0 0 6 2192179 208692 136784 11 54 19 15 0 26 10 0 310800 576 38093884 0 0 43 2138772 207105 118718 11 64 12 13 0 24 15 0 261108 576 38030828 0 0 68 2174015 205793 135302 11 56 18 15 0 9 17 0 241816 576 37982072 0 0 20 2076145 194971 120285 10 60 16 14 0 37 15 0 255972 576 37892868 0 0 14 2225076 205694 126954 11 59 16 13 0 27 16 0 243212 576 37872704 0 0 6 2249476 210885 134684 12 60 15 14 0 30 10 0 217572 576 37795388 0 0 3 2128688 205027 118319 11 66 12 11 0 26 11 0 236420 576 37740164 0 0 23 2109709 205105 133925 10 56 19 14 0 45 15 0 330056 576 37619896 0 0 15 1948311 196188 119330 10 62 15 13 0 54 15 0 242696 576 37631500 0 0 4 2159530 202595 132588 11 59 16 15 0 2012-06-27 13:58:59.569-06:00 vmstat -w 4 16 procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 14 15 0 274932 576 37621548 0 0 2 1301 15 0 1 4 95 0 0 44 16 0 278748 576 37509516 0 0 6 2097643 196722 112697 10 66 12 11 0 88 14 0 228088 576 37412008 0 0 6 2089559 202206 116146 10 66 13 11 0 94 11 0 348348 576 37270624 0 0 7 1906390 181488 87333 9 76 9 6 0 42 11 0 215996 576 37288556 0 0 6 1782459 184375 95901 9 74 9 7 0 45 11 0 323112 576 37146500 0 0 12 1868376 187814 103947 9 71 11 9 0 51 7 0 244560 576 37100124 0 0 35 1767496 181383 107259 9 68 12 11 0 74 12 0 221584 576 37016420 0 0 18 1884986 183376 93425 9 75 9 7 0 45 10 0 275564 576 36985324 0 0 23 1683688 167223 97036 8 75 9 8 0 19 10 0 322176 576 36813176 0 0 14 1747378 177594 97218 8 72 12 8 0 122 7 0 225256 576 36838084 0 0 26 1730643 177915 92621 8 75 9 8 0 243 10 0 223464 576 36765460 0 0 18 1730158 173059 79373 8 80 6 5 0 100 10 0 307528 576 36598456 0 0 4 1738567 174077 79585 9 82 6 4 0 243 6 0 370064 576 36358576 0 0 9 1586528 174680 85353 8 81 6 5 0 267 2 0 322640 576 36254044 0 0 40 1011650 129389 42277 5 93 1 1 0 210 4 0 505092 576 35865460 0 0 25 720825 116356 32422 3 96 1 0 0 2012-06-27 14:00:03.219-06:00 vmstat -w 4 16 procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 75 1 0 566988 576 35664800 0 0 2 1355 21 3 1 4 95 0 0 433 1 0 964052 576 35069112 0 0 7 456359 102256 20901 2 98 0 0 0 547 3 0 820116 576 34893932 0 0 57 560507 114878 28115 3 96 0 0 0 806 2 0 606992 576 34848180 0 0 339 309668 101230 21056 2 98 0 0 0 708 1 0 529624 576 34708000 0 0 248 370886 101327 20062 2 97 0 0 0 231 5 0 504772 576 34663880 0 0 305 334824 95045 20407 2 97 1 1 0 158 6 0 1063088 576 33518536 0 0 531 847435 130696 47140 4 92 1 2 0 193 0 0 1449156 576 33035572 0 0 363 371279 94470 18955 2 96 1 1 0 266 6 0 1623512 576 32728164 0 0 77 241114 95730 15483 2 98 0 0 0 243 8 0 1629504 576 32653080 0 0 81 471018 100223 20920 3 96 0 1 0 70 11 0 1342140 576 33084020 0 0 100 925869 139876 56599 6 88 3 3 0 211 7 0 1130316 576 33470432 0 0 290 1008984 150699 74320 6 83 6 5 0 365 3 0 776736 576 34072772 0 0 182 747167 139436 67135 5 88 4 3 0 29 1 0 1528412 576 34110640 0 0 50 612181 137403 77609 4 87 6 3 0 266 5 0 1657688 576 34105696 0 0 3 258307 62879 38508 2 93 3 2 0 1159 2 0 2002256 576 33775476 0 0 19 88554 42112 14230 1 98 0 0 0 Right around 14:00 I was able to get a "perf -a -g"; here's the beginning of what "perf report --sort symbol --call-graph fractal,5" had to say: # 64.86% [k] _raw_spin_lock_irqsave | |--97.94%-- isolate_freepages | compaction_alloc | unmap_and_move | migrate_pages | compact_zone | | | |--99.56%-- try_to_compact_pages | | __alloc_pages_direct_compact | | __alloc_pages_slowpath | | __alloc_pages_nodemask | | alloc_pages_vma | | do_huge_pmd_anonymous_page | | handle_mm_fault | | do_page_fault | | page_fault | | | | | |--53.53%-- skb_copy_datagram_iovec | | | tcp_recvmsg | | | inet_recvmsg | | | sock_recvmsg | | | sys_recvfrom | | | system_call_fastpath | | | __recv | | | | | | | --100.00%-- (nil) | | | | | |--27.80%-- __pthread_create_2_1 | | | (nil) | | | | | --18.67%-- memcpy | | | | | |--57.38%-- 0x50d000005 | | | | | |--34.52%-- 0x3b300bf271940a35 | | | | | --8.10%-- 0x1500000000000009 | --0.44%-- [...] --2.06%-- [...] 6.15% [k] isolate_freepages_block | |--99.95%-- isolate_freepages | compaction_alloc | unmap_and_move | migrate_pages | compact_zone | | | |--99.54%-- try_to_compact_pages | | __alloc_pages_direct_compact | | __alloc_pages_slowpath | | __alloc_pages_nodemask | | alloc_pages_vma | | do_huge_pmd_anonymous_page | | handle_mm_fault | | do_page_fault | | page_fault | | | | | |--54.40%-- skb_copy_datagram_iovec | | | tcp_recvmsg | | | inet_recvmsg | | | sock_recvmsg | | | sys_recvfrom | | | system_call_fastpath | | | __recv | | | | | | | --100.00%-- (nil) | | | | | |--25.19%-- __pthread_create_2_1 | | | (nil) | | | | | --20.41%-- memcpy | | | | | |--40.24%-- 0x3b300bf271940a35 | | | | | |--38.29%-- 0x1500000000000009 | | | | | --21.47%-- 0x50d000005 | --0.46%-- [...] --0.05%-- [...] 3.96% [.] ceph_crc32c_le | |--99.99%-- 0xb8057558d0065990 --0.01%-- [...] A different system in the same test had a slightly different call tree, but isolate_freepages() still seems to show up prominently: # 32.32% [k] _raw_spin_lock_irqsave | |--97.64%-- isolate_freepages | compaction_alloc | unmap_and_move | migrate_pages | compact_zone | try_to_compact_pages | __alloc_pages_direct_compact | __alloc_pages_slowpath | __alloc_pages_nodemask | alloc_pages_vma | do_huge_pmd_anonymous_page | handle_mm_fault | do_page_fault | page_fault | | | |--65.31%-- skb_copy_datagram_iovec | | tcp_recvmsg | | inet_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | | | --100.00%-- (nil) | | | |--30.98%-- memcpy | | | | | |--50.60%-- 0x50d0000 | | | | | --49.40%-- 0x50d000005 | --3.70%-- [...] --2.36%-- [...] 17.10% [k] _raw_spin_lock_irq | |--98.27%-- isolate_migratepages_range | compact_zone | try_to_compact_pages | __alloc_pages_direct_compact | __alloc_pages_slowpath | __alloc_pages_nodemask | alloc_pages_vma | do_huge_pmd_anonymous_page | handle_mm_fault | do_page_fault | page_fault | | | |--99.85%-- __pthread_create_2_1 | | (nil) | --0.15%-- [...] --1.73%-- [...] 4.59% [k] mutex_spin_on_owner | --- __mutex_lock_slowpath mutex_lock | |--50.14%-- page_lock_anon_vma | | | |--99.99%-- try_to_unmap_anon | | try_to_unmap | | __unmap_and_move | | unmap_and_move | | migrate_pages | | compact_zone | | try_to_compact_pages | | __alloc_pages_direct_compact | | __alloc_pages_slowpath | | __alloc_pages_nodemask | | alloc_pages_vma | | do_huge_pmd_anonymous_page | | handle_mm_fault | | do_page_fault | | page_fault | | | | | |--99.91%-- __pthread_create_2_1 | | | (nil) | | --0.09%-- [...] | --0.01%-- [...] | |--49.67%-- rmap_walk | move_to_new_page | __unmap_and_move | unmap_and_move | migrate_pages | compact_zone | try_to_compact_pages | __alloc_pages_direct_compact | __alloc_pages_slowpath | __alloc_pages_nodemask | alloc_pages_vma | do_huge_pmd_anonymous_page | handle_mm_fault | do_page_fault | page_fault | | | |--99.69%-- __pthread_create_2_1 | | (nil) | --0.31%-- [...] --0.20%-- [...] 4.10% [k] isolate_freepages_block | |--99.95%-- isolate_freepages | compaction_alloc | unmap_and_move | migrate_pages | compact_zone | try_to_compact_pages | __alloc_pages_direct_compact | __alloc_pages_slowpath | __alloc_pages_nodemask | alloc_pages_vma | do_huge_pmd_anonymous_page | handle_mm_fault | do_page_fault | page_fault | | | |--46.97%-- skb_copy_datagram_iovec | | tcp_recvmsg | | inet_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | | | --100.00%-- (nil) | | | |--31.79%-- __pthread_create_2_1 | | (nil) | | | --21.24%-- memcpy | | | |--61.90%-- 0x50d000005 | | | --38.10%-- 0x50d0000 --0.05%-- [...] 3.65% [.] ceph_crc32c_le | |--99.86%-- 0xb8057558d0065990 --0.14%-- [...] I seem to be able to recreate this issue at will, so please let me know what I can do to help learn what is going on. Thanks -- Jim P.S. I got the recipients list via "scripts/get_maintainer.pl -f mm/compaction.c"; please let me know if I should have done something else. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html