2020년 3월 19일 (목) 오전 6:29, David Rientjes <rientjes@xxxxxxxxxx>님이 작성: > > On Wed, 18 Mar 2020, js1304@xxxxxxxxx wrote: > > > From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> > > > > Currently, we use the zone index of preferred_zone which represents > > the best matching zone for allocation, as classzone_idx. It has a problem > > on NUMA system with ZONE_MOVABLE. > > > > Hi Joonsoo, Hello, David. > More specifically, it has a problem on NUMA systems when the lowmem > reserve protection exists for some zones on a node that do not exist on > other nodes, right? Right. > In other words, to make sure I understand correctly, if your node 1 had a > ZONE_MOVABLE than this would not have happened. If that's true, it might > be helpful to call out that ZONE_MOVABLE itself is not necessarily a > problem, but a system where one node has ZONE_NORMAL and ZONE_MOVABLE and > another only has ZONE_NORMAL is the problem. Okay. I will try to re-write the commit message as you suggested. > > In NUMA system, it can be possible that each node has different populated > > zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and > > node 1 could have only NORMAL zone. In this setup, allocation request > > initiated on node 0 and the one on node 1 would have different > > classzone_idx, 3 and 2, respectively, since their preferred_zones are > > different. If they are handled by only their own node, there is no problem. > > I'd say "If the allocation is local" rather than "If they are handled by > only their own node". I will replace it with yours. Thanks for correcting. > > However, if they are somtimes handled by the remote node due to memory > > shortage, the problem would happen. > > > > In the following setup, allocation initiated on node 1 will have some > > precedence than allocation initiated on node 0 when former allocation is > > processed on node 0 due to not enough memory on node 1. They will have > > different lowmem reserve due to their different classzone_idx thus > > an watermark bars are also different. > > > > root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo > > Node 0, zone DMA > > per-node stats > > ... > > pages free 3965 > > min 5 > > low 8 > > high 11 > > spanned 4095 > > present 3998 > > managed 3977 > > protection: (0, 2961, 4928, 5440) > > ... > > Node 0, zone DMA32 > > pages free 757955 > > min 1129 > > low 1887 > > high 2645 > > spanned 1044480 > > present 782303 > > managed 758116 > > protection: (0, 0, 1967, 2479) > > ... > > Node 0, zone Normal > > pages free 459806 > > min 750 > > low 1253 > > high 1756 > > spanned 524288 > > present 524288 > > managed 503620 > > protection: (0, 0, 0, 4096) > > ... > > Node 0, zone Movable > > pages free 130759 > > min 195 > > low 326 > > high 457 > > spanned 1966079 > > present 131072 > > managed 131072 > > protection: (0, 0, 0, 0) > > ... > > Node 1, zone DMA > > pages free 0 > > min 0 > > low 0 > > high 0 > > spanned 0 > > present 0 > > managed 0 > > protection: (0, 0, 1006, 1006) > > Node 1, zone DMA32 > > pages free 0 > > min 0 > > low 0 > > high 0 > > spanned 0 > > present 0 > > managed 0 > > protection: (0, 0, 1006, 1006) > > Node 1, zone Normal > > per-node stats > > ... > > pages free 233277 > > min 383 > > low 640 > > high 897 > > spanned 262144 > > present 262144 > > managed 257744 > > protection: (0, 0, 0, 0) > > ... > > Node 1, zone Movable > > pages free 0 > > min 0 > > low 0 > > high 0 > > spanned 262144 > > present 0 > > managed 0 > > protection: (0, 0, 0, 0) > > > > min watermark for NORMAL zone on node 0 > > allocation initiated on node 0: 750 + 4096 = 4846 > > allocation initiated on node 1: 750 + 0 = 750 > > > > This watermark difference could cause too many numa_miss allocation > > in some situation and then performance could be downgraded. > > > > Recently, there was a regression report about this problem on CMA patches > > since CMA memory are placed in ZONE_MOVABLE by those patches. I checked > > that problem is disappeared with this fix that uses high_zoneidx > > for classzone_idx. > > > > http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop > > > > Using high_zoneidx for classzone_idx is more consistent way than previous > > approach because system's memory layout doesn't affect anything to it. > > With this patch, both classzone_idx on above example will be 3 so will > > have the same min watermark. > > > > allocation initiated on node 0: 750 + 4096 = 4846 > > allocation initiated on node 1: 750 + 4096 = 4846 > > > > Alternatively, I assume that this could also be fixed by changing the > value of the lowmem protection on the node without managed pages in the > upper zone to be the max protection from the lowest zones? In your > example, node 1 ZONE_NORMAL would then be (0, 0, 0, 4096). No, if lowmem_reserve of node 0 ZONE_NORMAL is (0, 0, 4096, 4096), min watermark of the allocation initiated on node 1 is 750 + 4096(classzone_idx 2) when allocation is tried on node 0 ZONE_NORMAL and issue would be gone. So, I think that it cannot be fixed by your alternative. > > One could wonder if there is a side effect that allocation initiated on > > node 1 will use higher bar when allocation is handled on node 1 since > > classzone_idx could be higher than before. It will not happen because > > the zone without managed page doesn't contributes lowmem_reserve at all. > > > > Reported-by: Ye Xiaolong <xiaolong.ye@xxxxxxxxx> > > Tested-by: Ye Xiaolong <xiaolong.ye@xxxxxxxxx> > > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> > > Curious: is this only an issue when vm.numa_zonelist_order is set to Node? Do you mean "/proc/sys/vm/numa_zonelist_order"? It looks like it's gone now. Thanks.