Re: [PATCH RESEND part2 v2 1/8] x86: get pg_data_t's memory from other node

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Any comment on this or are the issues just going to be waved away?

On Mon, Jan 20, 2014 at 03:14:09PM +0000, Mel Gorman wrote:
> On Mon, Jan 20, 2014 at 03:29:41PM +0800, Tang Chen wrote:
> > Hi Mel,
> > 
> > On 01/17/2014 01:11 AM, Mel Gorman wrote:
> > >On Tue, Dec 03, 2013 at 10:22:00AM +0800, Zhang Yanfei wrote:
> > >>From: Yasuaki Ishimatsu<isimatu.yasuaki@xxxxxxxxxxxxxx>
> > >>
> > >>If system can create movable node which all memory of the node is allocated
> > >>as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
> > >>pg_data_t. So, invoke memblock_alloc_nid(...MAX_NUMNODES) again to retry when
> > >>the first allocation fails. Otherwise, the system could failed to boot.
> > >>(We don't use memblock_alloc_try_nid() to retry because in this function,
> > >>if the allocation fails, it will panic the system.)
> > >>
> > >
> > >This implies that it is possible to ahve a configuration with a big ratio
> > >difference between Normal:Movable memory. In such configurations there
> > >would be a risk that the system will reclaim heavily or go OOM because
> > >the kernrel cannot allocate memory due to a relatively small Normal
> > >zone. What protects against that? Is the user ever warned if the ratio
> > >between Normal:Movable very high?
> > 
> > For now, there is no way protecting against this. But on a modern
> > server, it won't be
> > that easy running out of memory when booting, I think.
> > 
> 
> 
> Booting is a basic functional requirement and I'm more concerned about the
> behaviour of the kernel when the machine is running.  If the kernel trashes
> heavily or goes OOM when a workload starts then the fact the machine booted
> is not much comfort.
> 
> > The current implementation will set any node the kernel resides in
> > as unhotpluggable,
> > which means normal zone here. And for nowadays server, especially
> > memory hotplug server,
> > each node would have at least 16GB memory, which is enough for the
> > kernel to boot.
> > 
> 
> Again, booting is fine but least say it's an 8-node machine then that
> implies the Normal:Movable ratio will be 1:8. All page table pages, inode,
> dentries etc will have to fit in that 1/8th of memory with all the associated
> costs including remote access penalties.  In extreme cases it may not be
> possible to use all of memory because the management structures cannot be
> allocated. Users may want the option of adjusting what this ratio is so
> they can unplug some memory while not completely sacrificing performance.
> 
> Minimally, the kernel should print a big fat warning if the ratio is equal
> or more than 1:3 Normal:Movable. That ratio selection is arbitrary. I do not
> recall ever seeing any major Normal:Highmem bugs on 4G 32-bit machines so it
> is a conservative choice. The last Normal:Highmem bug I remember was related
> to a 16G 32-bit machine (https://bugzilla.kernel.org/show_bug.cgi?id=42578)
> a 1:15 ratio feels very optimistic for a very large machine.
> 
> > We can add a patch to make it return to the original path if we run
> > out of memory,
> > which means turn off the functionality and warn users in log.
> > 
> > How do you think ?
> > 
> 
> I think that will allow the machine to boot but that there still will be a
> large number of bugs filed with these machines due to high Normal:Movable
> ratios. The shape of the bug reports will be similar to the Normal:Highmem
> ratio bugs that existed years ago.
> 
> > > The movable_node boot parameter still
> > >turns the feature on and off, there appears to be no way of controlling
> > >the ratio of memory other than booting with the minimum amount of memory
> > >and manually hot-adding the sections to set the appropriate ratio.
> > 
> > For now, yes. We expect firmware and hardware to give the basic
> > ratio (how much memory
> > is hotpluggable), and the user decides how to arrange the memory
> > (decide the size of
> > normal zone and movable zone).
> > 
> 
> There seems to be big gaps in the configuration options here. The user
> can either ask it to be automatically assigned and have no control of
> the ratio or manually hot-add the memory which is a relatively heavy
> administrative burden.
> 
> I think they should be warned if the ratio is high and have an option of
> specifying a ratio manually even if that means that additional nodes
> will not be hot-removable.
> 
> This is all still a kludge around the fact that node memory hot-remove
> did not try and cope with full migration by breaking some of the 1:1
> virt:phys mapping assumptions when hot-remove was enabled.
> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]