On Fri 24-02-17 14:49:52, Jia He wrote: > In a numa server, topology looks like > available: 3 nodes (0,2-3) > node 0 cpus: > node 0 size: 0 MB > node 0 free: 0 MB > node 2 cpus: 0 1 2 3 4 5 6 7 > node 2 size: 15299 MB > node 2 free: 289 MB > node 3 cpus: > node 3 size: 15336 MB > node 3 free: 184 MB > node distances: > node 0 2 3 > 0: 10 40 40 > 2: 40 10 20 > 3: 40 20 10 > > When I try to dynamically allocate the hugepages more than system total free > memory: > e.g. echo 4000 >/proc/sys/vm/nr_hugepages > > Then the kswapd will take 100% cpu for a long time(more than 3 hours, and will > not be about to end) > top result: > top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01 > Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie > %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st > KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers > KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3 > > The root cause is: kswapd3 is waken up and then try to do reclaim again and > again but it makes no progress. At last the allocated hugepages are less than > 4000. > HugePages_Total: 1864 > HugePages_Free: 1864 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 16384 kB > > At that time, even there are no relaimable pages in that node3, kswapd3 will > not go to sleep. > Node 3, zone DMA > per-node stats > nr_inactive_anon 0 > nr_active_anon 0 > nr_inactive_file 0 > nr_active_file 0 > nr_unevictable 0 > nr_isolated_anon 0 > nr_isolated_file 0 > nr_pages_scanned 0 > workingset_refault 0 > workingset_activate 0 > workingset_nodereclaim 0 > nr_anon_pages 0 > nr_mapped 0 > nr_file_pages 0 > nr_dirty 0 > nr_writeback 0 > nr_writeback_temp 0 > nr_shmem 0 > nr_shmem_hugepages 0 > nr_shmem_pmdmapped 0 > nr_anon_transparent_hugepages 0 > nr_unstable 0 > nr_vmscan_write 0 > nr_vmscan_immediate_reclaim 0 > nr_dirtied 0 > nr_written 0 > pages free 2951 > min 2821 > low 3526 > high 4231 > node_scanned 0 > spanned 245760 > present 245760 > managed 245388 > nr_free_pages 2951 > nr_zone_inactive_anon 0 > nr_zone_active_anon 0 > nr_zone_inactive_file 0 > nr_zone_active_file 0 > nr_zone_unevictable 0 > nr_zone_write_pending 0 > nr_mlock 0 > nr_slab_reclaimable 46 > nr_slab_unreclaimable 90 > nr_page_table_pages 0 > nr_kernel_stack 0 > nr_bounce 0 > nr_zspages 0 > numa_hit 2257 > numa_miss 0 > numa_foreign 0 > numa_interleave 982 > numa_local 0 > numa_other 2257 > nr_free_cma 0 > protection: (0, 0, 0, 0) > It would be called a misconfiguration but it seems that it might be quite easy > to hit with NUMA machines which have large differences in the node sizes. > > Further more, when it consumes most the memory in node3, every alloc slow path > might wake up kswapd3 and it will make things worse: > __alloc_pages_slowpath > wake_all_kswapds > wakeup_kswapd > > This patch resolves the issue from 2 aspects: > 1. In prepare_kswapd_sleep, only when zone is not balanced and there are > reclaimable pages in this zone, kswapd will go to do relaim without sleeping > 2. Don't wake up kswapd if there are no reclaimable pages in that node > > After this patch: > top - 07:29:43 up 3 min, 1 user, load average: 0.12, 0.13, 0.06 > Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie > %Cpu(s): 0.0 us, 0.2 sy, 0.0 ni, 97.8 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st > KiB Mem: 31371520 total, 938112 used, 30433408 free, 5504 buffers > KiB Swap: 6284224 total, 0 used, 6284224 free. 632448 cached Mem > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 78 root 20 0 0 0 0 S 0.000 0.000 0:00.00 kswapd3 > > Changes: > V2: - fix incorrect condition for assignment of node_has_reclaimable_pages > - make commit decription better I believe we should pursue the proposal from Johannes which is more generic and copes with corner cases much better. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>