Re: [PATCH v4] KSM: numa awareness sysfs knob

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Hugh,

first of all, please let me apologize for the delay in my response and thank
you for your extensive review!

On Sun, 30 Sep 2012, Hugh Dickins wrote:
> Andrea's point about ksm_migrate_page() is an important one, and I've
> answered that against his mail, but here's some other easier points.
> 
> On Mon, 24 Sep 2012, Petr Holasek wrote:
> 
> > Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
> > which control merging pages across different numa nodes.
> > When it is set to zero only pages from the same node are merged,
> > otherwise pages from all nodes can be merged together (default behavior).
> > 

...

> > 
> > v4:	- merge_nodes was renamed to merge_across_nodes
> > 	- share_all debug knob was dropped
> 
> Yes, you were right to drop the share_all knob for now.  I do like the
> idea, but it was quite inappropriate to mix it in with this NUMAnode
> patch.  And although I like the idea, I think it wants a bit more: I
> already have a hacky "allksm" boot option patch to mm/mmap.c for my
> own testing, which adds VM_MERGEABLE everywhere.  If I gave that up
> (I'd like to!) in favour of yours, I think I would still be missing
> all the mms that are created before there's a chance to switch the
> share_all mode on.  Or maybe I misread your v3.  Anyway, that's a
> different topic, happily taken off today's agenda.
> 

Agreed, it hid original purpose of this patch and made it more difficult for
eventual merging. So let's move it lower on the ksm todo list for this time :)

> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 47c8853..7c82032 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -36,6 +36,7 @@
> >  #include <linux/hash.h>
> >  #include <linux/freezer.h>
> >  #include <linux/oom.h>
> > +#include <linux/numa.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include "internal.h"
> > @@ -140,7 +141,10 @@ struct rmap_item {
> >  	unsigned long address;		/* + low bits used for flags below */
> >  	unsigned int oldchecksum;	/* when unstable */
> >  	union {
> > -		struct rb_node node;	/* when node of unstable tree */
> > +		struct {
> > +			struct rb_node node;	/* when node of unstable tree */
> > +			struct rb_root *root;
> > +		};
> 
> This worries me a little, enlarging struct rmap_item: there may
> be very many of them in the system, best to minimize their size.
> 
> (This struct rb_root *root is used for one thing only, isn't it?  For the
> unstable rb_erase() in remove_rmap_item_from_tree().  It annoys me that
> we need to record root just for that, but I don't see an alternative.)

Yes, I've played a quite lot with this issue, but wasn't able to find an
alternative solution, too.

> 
> The 64-bit case can be easily improved by locating unsigned int nid
> after oldchecksum instead.  The 32-bit case (which supports smaller
> NODES_SHIFT: 6 was the largest I found) could be "improved" by keeping
> nid in the lower bits of address along with (moved) UNSTABLE_FLAG and
> STABLE_FLAG and reduced SEQNR_MASK - we really need only 1 bit for that,
> but 2 bits would be nice for keeping the BUG_ON(age > 1).
> 
> But you may think I'm going too far there, and prefer just to put
> #ifdef CONFIG_NUMA around the unsigned int nid, so at least it does
> not enlarge the more basic 32-bit configuration.
> 

I like your idea of unsigned int nid, will implement it in next version.

...

> >  
> > -	for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
> > -		struct stable_node *stable_node;
> > +	for (i = 0; i < MAX_NUMNODES; i++)
> 
> It's irritating to have to do this outer nid loop, but I think you're
> right, that the memory hotremove notification does not quite tell us
> which node to look at.  Or can we derive that from start_pfn?  Would
> it be safe to assume that end_pfn-1 must be in the same node?
> 

I had assumed that we can't rely on end_pfn-1 in the same node, but now
mm/memory_hotremove.c looks to me that we can rely on it because memory hotremove
callback is triggered for each zone where we are removing memory.
So I think yes, we can optimize it in the way you mentioned above. If I am wrong
correct me, please :)

...

> 
> Looks nice - thank you.
> 

Thanks for your help!

Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]