On Mon, Dec 17, 2018 at 11:25:34AM +0100, Michal Hocko wrote: >On Sun 16-12-18 20:56:24, Wei Yang wrote: >> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while >> current implementation doesn't comply with this rule when kernel >> parameter "kernelcore=" is used. >> >> Current implementation doesn't harm the system, since the value in >> zone_movable_pfn is out of the range of current zone. While user would >> see this message during bootup, even that node doesn't has ZONE_MOVABLE. >> >> Movable zone start for each node >> Node 0: 0x0000000080000000 > >I am sorry but the above description confuses me more than it helps. >Could you start over again and describe the user visible problem, then >follow up with the udnerlying bug and finally continue with a proposed >fix? Yep, how about this one: For example, a machine with 8G RAM, 2 nodes with 4G on each, if we pass "kernelcore=2G" as kernel parameter, the dmesg looks like: Movable zone start for each node Node 0: 0x0000000080000000 Node 1: 0x0000000100000000 This looks like both Node 0 and 1 has ZONE_MOVABLE, while the following dmesg shows only Node 1 has ZONE_MOVABLE. On node 0 totalpages: 524190 DMA zone: 64 pages used for memmap DMA zone: 21 pages reserved DMA zone: 3998 pages, LIFO batch:0 DMA32 zone: 8128 pages used for memmap DMA32 zone: 520192 pages, LIFO batch:63 On node 1 totalpages: 524255 DMA32 zone: 4096 pages used for memmap DMA32 zone: 262111 pages, LIFO batch:63 Movable zone: 4096 pages used for memmap Movable zone: 262144 pages, LIFO batch:63 The good news is current result doesn't harm the ZONE_MOVABLE calculation, while it confuse user and may lead to code inconsistency. For example, in adjust_zone_range_for_zone_movable(), the comment says "Only adjust if ZONE_MOVABLE is on this node" by check zone_movable_pfn. But we can see this doesn't hold for all cases. The cause of this problem is we leverage zone_movable_pfn during the iteration to record where we have touched and reduce double account. But after using this, those temporary data is not cleared. To fix this issue, we may have several ways. In this patch I propose the one with minimal change of current code by taking advantage of the highest bit of zone_movable_pfn. When the zone_movable_pfn is a temporary calculation data, the highest bit is set. After the entire calculation is complete, zone_movable_pfn with highest bit set will be cleared. >-- >Michal Hocko >SUSE Labs -- Wei Yang Help you, Help me