[PATCH/RFC 4/14] Shared Policy: let vma policy ops handle sub-vma policies

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Shared Policy Infrastructure  - let vma policy ops handle sub-vma policies

Shared policies can handle subranges of an object.  No need to
split the vma for these mappings.  So, ...

Add a vm_mpol_flag member to vm_area_struct with flags to control the
vma splitting behavior.

	Note:  Perhaps there is another field where these flags
	could go?

Modify mbind_range() and policy_vma() to call the set_policy vma op,
if one exists, for vma's with VMPOL_F_NOSPLIT set, instead of splitting
the vma for the mempolicy range.  However, if vma is VM_SHARED and we
would be splitting the vma--don't!  This policy will just be ignored
and numa_maps would be [currently are] misleading.

Now, we don't want private mappings mucking with the shared policy of
the mapped file, if any, so use vma policy for private mappings.
We'll still split vmas for private mappings.

As a result, this patch enforces a defined semantic for set|get_policy()
ops:  they only get called for linear, shared mappings, and in that
case we don't split the vma.  Only shmem currently has set|get_policy()
ops, and this seems an appropriate semantic for shared objects, in
general.  It also matches current behavior.

Now, since the vma start and end addresses no longer specify the
range to which a new policy applies, we need to add start,end address
args to the vma policy ops.  The set_policy op/handler just calls into
mpol_set_shared_policy() to do the real work, so we could just pass
the start and end address, along with the vma, down to that function.
However, to eliminate the need for the pseudo-vma on the stack when
initializing the shared policy for an inode with non-default "superblock
policy", we change the interface to mpol_set_shared_policy() to take a
page offset and size in pages.  We compute the page offset and size in
the shmem set_policy handler from the vma and the address range.

NOTE:  Added helper function "vma_mpol_pgoff()" for computing page
offset for interleaving.  This is similar to [linear_]page_index()
but does not offset by the PAGE_CACHE_SHIFT so that it can be used
for calculating page indices for interleaving for both base pages
and huge pages [subsequent patch].  Perhaps this can be merged with
other similar functions?

A word about non-linear mappings:

Shared policy ops do and always have installed and looked up shared
policies at a given (vma, address) by computing a page offset and size
into the backing file from the (vma, address) assuming a linear mapping
from virtual addresses to file page offsets.  Therefore this patch series
restricts shared policies to linearly mapped shared file mappings.  This
is nominally a change in behavior.

I don't know whether anyone is attempting to use memory policies with
non-linearly mapped shared memory areas or hugetlbfs mappings.  If so,
I don't understand what behavior they expect.  Since different tasks
could establish different mappings to the shared files, pages and mempolicies
that show up at one offset in one task, can show up at a different offset
in another task.  However, if this is a required feature, and we can come up
with reasonable semantics for supporting shared policies with nonlinear
mappings, I can try to support it.

Note:  although this patch removed the splitting of the vma when a shared
policy is installed in a sub-range of the shared area, I will defer updating
the documentation to the following patch, which fixes the numa_maps display.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@xxxxxx>

 fs/sysfs/bin.c                |    5 +
 include/linux/mempolicy.h     |   16 +++++
 include/linux/mm.h            |   13 ++++
 include/linux/mm_types.h      |    1 
 include/linux/shared_policy.h |    7 +-
 ipc/shm.c                     |    5 +
 mm/mempolicy.c                |  119 ++++++++++++++++++++++++++++++++----------
 mm/shmem.c                    |   16 +++--
 8 files changed, 140 insertions(+), 42 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/mm_types.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mm_types.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mm_types.h
@@ -182,6 +182,7 @@ struct vm_area_struct {
 #endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+	int vm_mpol_flags;		/* NOSPLIT, ... */
 #endif
 };
 
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mempolicy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
@@ -110,6 +110,13 @@ struct mempolicy {
 };
 
 /*
+ * vma memory policy flags
+ */
+enum vm_mpol_flags {
+	VMPOL_F_NOSPLIT  = 0x00000001,	/* don't split vma for mempolicy */
+};
+
+/*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
  */
@@ -157,6 +164,11 @@ static inline struct mempolicy *mpol_dup
 #define vma_policy(vma) ((vma)->vm_policy)
 #define vma_set_policy(vma, pol) ((vma)->vm_policy = (pol))
 
+static inline void mpol_set_vma_nosplit(struct vm_area_struct *vma)
+{
+	(vma)->vm_mpol_flags |= VMPOL_F_NOSPLIT;
+}
+
 static inline void mpol_get(struct mempolicy *pol)
 {
 	if (pol)
@@ -258,6 +270,10 @@ static inline struct mempolicy *mpol_dup
 #define vma_policy(vma) NULL
 #define vma_set_policy(vma, pol) do {} while(0)
 
+static inline void mpol_set_vma_nosplit(struct vm_area_struct *vma)
+{
+}
+
 static inline void numa_policy_init(void)
 {
 }
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mm.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mm.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mm.h
@@ -212,7 +212,8 @@ struct vm_operations_struct {
 	 * install a MPOL_DEFAULT policy, nor the task or system default
 	 * mempolicy.
 	 */
-	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
+	int (*set_policy)(struct vm_area_struct *vma, unsigned long start,
+				unsigned long end, struct mempolicy *new);
 
 	/*
 	 * get_policy() op must add reference [mpol_get()] to any policy at
@@ -1232,6 +1233,16 @@ extern int after_bootmem;
 
 extern void setup_per_cpu_pageset(void);
 
+/*
+ * Address to offset for policy lookup and interleave calculation.
+ * Placed here because it needs struct vma definition.
+ */
+static inline pgoff_t vma_mpol_pgoff(struct vm_area_struct *vma,
+						unsigned long addr)
+{
+	return ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+}
+
 extern void zone_pcp_update(struct zone *zone);
 
 /* nommu.c */
Index: linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/shared_policy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
@@ -32,9 +32,8 @@ struct shared_policy {
 extern struct shared_policy *mpol_shared_policy_new(
 					struct address_space *mapping,
 					struct mempolicy *mpol);
-extern int mpol_set_shared_policy(struct shared_policy *,
-					struct vm_area_struct *,
-					struct mempolicy *);
+extern int mpol_set_shared_policy(struct shared_policy *, pgoff_t,
+				unsigned long, struct mempolicy *);
 extern void mpol_free_shared_policy(struct address_space *);
 extern struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
 					unsigned long);
@@ -44,7 +43,7 @@ extern struct mempolicy *mpol_shared_pol
 struct shared_policy {};
 
 static inline int mpol_set_shared_policy(struct shared_policy *info,
-					struct vm_area_struct *vma,
+					pgoff_t pgoff, unsigned long sz,
 					struct mempolicy *new)
 {
 	return -EINVAL;
Index: linux-2.6.36-mmotm-101103-1217/mm/shmem.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/shmem.c
+++ linux-2.6.36-mmotm-101103-1217/mm/shmem.c
@@ -1158,7 +1158,7 @@ struct page *shmem_swapin(swp_entry_t en
 	/* Create a pseudo vma that just contains the policy */
 	pvma.vm_start = 0;
 	pvma.vm_pgoff = idx;
-	pvma.vm_ops = NULL;
+	pvma.vm_file = NULL;
 	pvma.vm_policy = spol;
 	page = swapin_readahead(entry, gfp, &pvma, 0);
 	return page;
@@ -1172,7 +1172,7 @@ static struct page *shmem_alloc_page(gfp
 	/* Create a pseudo vma that just contains the policy */
 	pvma.vm_start = 0;
 	pvma.vm_pgoff = idx;
-	pvma.vm_ops = NULL;
+	pvma.vm_file = NULL;
 	pvma.vm_policy = mpol_shared_policy_lookup(sp, idx);
 
 	/*
@@ -1525,7 +1525,8 @@ static int shmem_fault(struct vm_area_st
 }
 
 #ifdef CONFIG_NUMA
-static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+static int shmem_set_policy(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct shared_policy *sp = mapping_shared_policy(mapping);
@@ -1535,7 +1536,8 @@ static int shmem_set_policy(struct vm_ar
 		if (IS_ERR(sp))
 			return PTR_ERR(sp);
 	}
-	return mpol_set_shared_policy(sp, vma, new);
+	return mpol_set_shared_policy(sp, vma_mpol_pgoff(vma, start),
+					(end - start) >> PAGE_SHIFT, new);
 }
 
 static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
@@ -1543,12 +1545,10 @@ static struct mempolicy *shmem_get_polic
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct shared_policy *sp = mapping_shared_policy(mapping);
-	unsigned long idx;
 
 	if (!sp)
 		return NULL;	/* == default policy */
-	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-	return mpol_shared_policy_lookup(sp, idx);
+	return mpol_shared_policy_lookup(sp, vma_mpol_pgoff(vma, addr));
 }
 #endif
 
@@ -1583,6 +1583,7 @@ static int shmem_mmap(struct file *file,
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
 	vma->vm_flags |= VM_CAN_NONLINEAR;
+	mpol_set_vma_nosplit(vma);
 	return 0;
 }
 
@@ -2802,5 +2803,6 @@ int shmem_zero_setup(struct vm_area_stru
 		fput(vma->vm_file);
 	vma->vm_file = file;
 	vma->vm_ops = &shmem_vm_ops;
+	mpol_set_vma_nosplit(vma);
 	return 0;
 }
Index: linux-2.6.36-mmotm-101103-1217/ipc/shm.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/ipc/shm.c
+++ linux-2.6.36-mmotm-101103-1217/ipc/shm.c
@@ -223,13 +223,14 @@ static int shm_fault(struct vm_area_stru
 }
 
 #ifdef CONFIG_NUMA
-static int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+int shm_set_policy(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new)
 {
 	struct file *file = vma->vm_file;
 	struct shm_file_data *sfd = shm_file_data(file);
 	int err = 0;
 	if (sfd->vm_ops->set_policy)
-		err = sfd->vm_ops->set_policy(vma, new);
+		err = sfd->vm_ops->set_policy(vma, start, end, new);
 	return err;
 }
 
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -610,20 +610,60 @@ check_range(struct mm_struct *mm, unsign
 	return first;
 }
 
-/* Apply policy to a single VMA */
-static int policy_vma(struct vm_area_struct *vma, struct mempolicy *new)
+static bool vma_is_shared_linear(struct vm_area_struct *vma)
+{
+	return ((vma->vm_flags & (VM_SHARED|VM_NONLINEAR)) == VM_SHARED);
+}
+
+static bool mpol_nosplit_vma(struct vm_area_struct *vma)
+{
+	if (vma->vm_mpol_flags & VMPOL_F_NOSPLIT)
+		return true;
+
+	if (vma_is_shared_linear(vma)  &&
+	    vma->vm_ops && vma->vm_ops->set_policy) {
+		vma->vm_mpol_flags |= VMPOL_F_NOSPLIT;
+		return true;
+	}
+	return false;
+}
+
+static bool mpol_use_get_op(struct vm_area_struct *vma)
+{
+	/*
+	 * Not for anon/private mappings.
+	 * And no need to invoke get_policy op if file doesn't
+	 * already have a shared policy.
+	 */
+	if (!vma_is_shared_linear(vma) ||
+	    !vma->vm_file || !vma->vm_file->f_mapping->spolicy)
+		return false;
+
+	VM_BUG_ON(!(vma->vm_ops && vma->vm_ops->get_policy));
+	return true;
+}
+
+/*
+ * Apply policy to a single VMA, or a subrange thereof
+ */
+static int policy_vma(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new,
+			bool use_set_op)
 {
 	int err = 0;
-	struct mempolicy *old = vma->vm_policy;
 
 	pr_debug("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
-		 vma->vm_start, vma->vm_end, vma->vm_pgoff,
+		 start, end, vma_mpol_pgoff(vma, start),
 		 vma->vm_ops, vma->vm_file,
 		 vma->vm_ops ? vma->vm_ops->set_policy : NULL);
 
-	if (vma->vm_ops && vma->vm_ops->set_policy)
-		err = vma->vm_ops->set_policy(vma, new);
-	if (!err) {
+	/*
+	 * set_policy op, if exists, is responsible for policy ref counts.
+	 */
+	if (use_set_op)
+		err = vma->vm_ops->set_policy(vma, start, end, new);
+	else {
+		struct mempolicy *old = vma->vm_policy;
 		mpol_get(new);
 		vma->vm_policy = new;
 		mpol_put(old);
@@ -652,6 +692,28 @@ static int mbind_range(struct mm_struct
 		vmstart = max(start, vma->vm_start);
 		vmend   = min(end, vma->vm_end);
 
+		if (mpol_nosplit_vma(vma)) {
+			/*
+			 * set_policy op handles policies on sub-range
+			 * of vma for linear, shared mappings
+			 */
+			err = policy_vma(vma, vmstart, vmend, new_pol, true);
+			if (err)
+				break;
+			continue;
+		} else if (vma->vm_flags & VM_SHARED) {
+			/*
+			 * mempolicy will be ignored, so don't bother to
+			 * modify vma.  numa_maps would be misleading.
+			 */
+			continue;
+		}
+
+		/*
+		 * for private mappings and shared mappings of objects whose
+		 * mempolicy vm_ops don't support sub-range policies,
+		 * merge/split the vma, as needed, and use vma policy
+		 */
 		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
 				  vma->anon_vma, vma->vm_file, pgoff, new_pol);
@@ -670,7 +732,7 @@ static int mbind_range(struct mm_struct
 			if (err)
 				goto out;
 		}
-		err = policy_vma(vma, new_pol);
+		err = policy_vma(vma, vmstart, vmend, new_pol, false);
 		if (err)
 			goto out;
 	}
@@ -1493,7 +1555,10 @@ static struct mempolicy *get_vma_policy(
 	struct mempolicy *pol = task->mempolicy;
 
 	if (vma) {
-		if (vma->vm_ops && vma->vm_ops->get_policy) {
+		/*
+		 * use get_policy op, if any, for shared mappings
+		 */
+		if (mpol_use_get_op(vma)) {
 			struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
 									addr);
 			if (vpol)
@@ -2193,14 +2258,8 @@ put_free:
 	spin_lock_init(&sp->lock);
 
 	if (new) {
-		/*
-		 * Create pseudo-vma to specify policy range and
-		 * install new mempolicy
-		 */
-		struct vm_area_struct pvma;
-		memset(&pvma, 0, sizeof(struct vm_area_struct));
-		pvma.vm_end = TASK_SIZE;	/* policy covers entire file */
-		err = mpol_set_shared_policy(sp, &pvma, new); /* adds ref */
+		err = mpol_set_shared_policy(sp, 0UL, TASK_SIZE >> PAGE_SHIFT,
+						new);
 		mpol_put(new);			/* drop initial ref */
 	}
 
@@ -2220,25 +2279,33 @@ put_free:
 	return spx;
 }
 
+/**
+ * mpol_set_shared_policy - install mempolicy in shared policy tree
+ * @sp:	 pointer to shared policy struct
+ * @pgoff:  offset in address_space where mempolicy applies
+ * @sz:  size of range [bytes] to which mempolicy applies
+ * @mpol:  the mempolicy to install
+ *
+ */
 int mpol_set_shared_policy(struct shared_policy *sp,
-			struct vm_area_struct *vma, struct mempolicy *npol)
+				pgoff_t pgoff, unsigned long sz,
+				struct mempolicy *mpol)
 {
 	int err;
 	struct sp_node *new = NULL;
-	unsigned long sz = vma_pages(vma);
 
 	pr_debug("set_shared_policy %lx sz %lu %d %d %lx\n",
-		 vma->vm_pgoff,
-		 sz, npol ? npol->mode : -1,
-		 npol ? npol->flags : -1,
-		 npol ? nodes_addr(npol->v.nodes)[0] : -1);
+		 pgoff,
+		 sz, mpol ? mpol->mode : -1,
+		 mpol ? mpol->flags : -1,
+		 mpol ? nodes_addr(mpol->v.nodes)[0] : -1);
 
-	if (npol) {
-		new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol);
+	if (mpol) {
+		new = sp_alloc(pgoff, pgoff + sz, mpol);
 		if (!new)
 			return -ENOMEM;
 	}
-	err = shared_policy_replace(sp, vma->vm_pgoff, vma->vm_pgoff+sz, new);
+	err = shared_policy_replace(sp, pgoff, pgoff+sz, new);
 	if (err && new)
 		kmem_cache_free(sn_cache, new);
 	return err;
Index: linux-2.6.36-mmotm-101103-1217/fs/sysfs/bin.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/sysfs/bin.c
+++ linux-2.6.36-mmotm-101103-1217/fs/sysfs/bin.c
@@ -256,7 +256,8 @@ static int bin_access(struct vm_area_str
 }
 
 #ifdef CONFIG_NUMA
-static int bin_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+static int bin_set_policy(struct vm_area_struct *vma, unsigned long start,
+				unsigned long end, struct mempolicy *new)
 {
 	struct file *file = vma->vm_file;
 	struct bin_buffer *bb = file->private_data;
@@ -271,7 +272,7 @@ static int bin_set_policy(struct vm_area
 
 	ret = 0;
 	if (bb->vm_ops->set_policy)
-		ret = bb->vm_ops->set_policy(vma, new);
+		ret = bb->vm_ops->set_policy(vma, start, end, new);
 
 	sysfs_put_active(attr_sd);
 	return ret;
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]     [Devices]

  Powered by Linux