Re: [PATCH] mm, page_alloc: clear zone_movable_pfn if the node doesn't have ZONE_MOVABLE

Wei Yang <richard.weiyang@xxxxxxxxx> · Tue, 18 Dec 2018 20:27:43 +0000

On Tue, Dec 18, 2018 at 03:47:24PM +0100, Michal Hocko wrote:
>On Tue 18-12-18 14:39:43, Wei Yang wrote:
>> On Tue, Dec 18, 2018 at 01:14:51PM +0100, Michal Hocko wrote:
>> >On Mon 17-12-18 14:18:02, Wei Yang wrote:
>> >> On Mon, Dec 17, 2018 at 11:25:34AM +0100, Michal Hocko wrote:
>> >> >On Sun 16-12-18 20:56:24, Wei Yang wrote:
>> >> >> A non-zero zone_movable_pfn indicates this node has ZONE_MOVABLE, while
>> >> >> current implementation doesn't comply with this rule when kernel
>> >> >> parameter "kernelcore=" is used.
>> >> >> 
>> >> >> Current implementation doesn't harm the system, since the value in
>> >> >> zone_movable_pfn is out of the range of current zone. While user would
>> >> >> see this message during bootup, even that node doesn't has ZONE_MOVABLE.
>> >> >> 
>> >> >>     Movable zone start for each node
>> >> >>       Node 0: 0x0000000080000000
>> >> >
>> >> >I am sorry but the above description confuses me more than it helps.
>> >> >Could you start over again and describe the user visible problem, then
>> >> >follow up with the udnerlying bug and finally continue with a proposed
>> >> >fix?
>> >> 
>> >> Yep, how about this one:
>> >> 
>> >> For example, a machine with 8G RAM, 2 nodes with 4G on each, if we pass
>> >
>> >Did you mean 2G on each? Because your nodes do have 2GB each.
>> >
>> >> "kernelcore=2G" as kernel parameter, the dmesg looks like:
>> >> 
>> >>      Movable zone start for each node
>> >>        Node 0: 0x0000000080000000
>> >>        Node 1: 0x0000000100000000
>> >> 
>> >> This looks like both Node 0 and 1 has ZONE_MOVABLE, while the following
>> >> dmesg shows only Node 1 has ZONE_MOVABLE.
>> >
>> >Well, the documentation says
>> >	kernelcore=	[KNL,X86,IA-64,PPC]
>> >			Format: nn[KMGTPE] | nn% | "mirror"
>> >			This parameter specifies the amount of memory usable by
>> >			the kernel for non-movable allocations.  The requested
>> >			amount is spread evenly throughout all nodes in the
>> >			system as ZONE_NORMAL.  The remaining memory is used for
>> >			movable memory in its own zone, ZONE_MOVABLE.  In the
>> >			event, a node is too small to have both ZONE_NORMAL and
>> >			ZONE_MOVABLE, kernelcore memory will take priority and
>> >			other nodes will have a larger ZONE_MOVABLE.
>> 
>> Yes, current behavior is a little bit different.
>
>Then it is either a bug in implementation or documentation.
>
>> 
>> When you look at find_usable_zone_for_movable(), the ZONE_MOVABLE is in the
>> highest ZONE. Which means if a node doesn't has the highest zone, all
>> its memory belongs to kernelcore.
>
>Each node can have all zones. DMA and DMA32 have address range specific
>but there is always NORMAL zone to hold kernel memory irrespective of
>the pfn range.
>
>> 
>> Looks like a design decision?
>> 
>> >
>> >>      On node 0 totalpages: 524190
>> >>        DMA zone: 64 pages used for memmap
>> >>        DMA zone: 21 pages reserved
>> >>        DMA zone: 3998 pages, LIFO batch:0
>> >>        DMA32 zone: 8128 pages used for memmap
>> >>        DMA32 zone: 520192 pages, LIFO batch:63
>> >>      
>> >>      On node 1 totalpages: 524255
>> >>        DMA32 zone: 4096 pages used for memmap
>> >>        DMA32 zone: 262111 pages, LIFO batch:63
>> >>        Movable zone: 4096 pages used for memmap
>> >>        Movable zone: 262144 pages, LIFO batch:63
>> >
>> >so assuming your really have 4GB in total and 2GB should be in kernel
>> >zones then each node should get half of it to kernel zones and the
>> >remaining 2G evenly distributed to movable zones. So something seems
>> >broken here.
>> 
>> In case we really have this implemented. We will have following memory
>> layout.
>> 
>> 
>>     +---------+------+---------+--------+------------+
>>     |DMA      |DMA32 |Movable  |DMA32   |Movable     |
>>     +---------+------+---------+--------+------------+
>>     |<        Node 0          >|<      Node 1       >|
>> 
>> This means we have none-monotonic increasing zone.
>> 
>> Is this what we expect now? If this is, we really have someting broken.
>
>Absolutely. Each node can have all zones as mentioned above.
>

Ok, this seems the implementation is not correct now.

BTW, would this eat lower zone's memory? For example, has less DMA32?

>-- 
>Michal Hocko
>SUSE Labs

-- 
Wei Yang
Help you, Help me