Re: [patch 2/6] mm: introduce pte_special pte bit

Nick Piggin <npiggin@xxxxxxx> · Mon, 21 Jan 2008 10:43:41 +0100

On Fri, Jan 18, 2008 at 11:46:22PM +0100, Nick Piggin wrote:
> On Fri, Jan 18, 2008 at 08:41:22AM -0800, Linus Torvalds wrote:
>  
> > Then, just have a library version of the long form, and make architectures 
> > that don't support it just use that (just so that you don't have to 
> > duplicate that silly thing). So an architecture that support special page 
> > flags would do somethiing like
> > 
> > 	#define set_special_page(vma,addr,ptep,pfn,prot) \
> > 		set_pte_at(vma, addr, ptep, mk_special_pte(pfn,prot))
> > 	#define vm_normal_page(vma,addr,pte)
> > 		(pte_special(pte) ? NULL : pte_page(pte))
> > 
> > and other architectures would just do
> > 
> > 	#define set_special_page(vma,addr,ptep,pfn,prot) \
> > 		set_pte_at(vma, addr, ptep, mk_pte(pfn,prot))
> > 	#define vm_normal_page(vma,addr,pte) \
> > 		generic_vm_normal_page(vma,addr,pte)
> > 
> > or something.
> > 
> > THAT is what I mean by "no #ifdef's in code" - that the selection is done 
> > at a higher level, the same way we have good interfaces with clear 
> > *conceptual* meaning for all the other PTE accessing stuff, rather than 
> > have conditionals in the architecture-independent code.
> 
> OK, that gets around the "duplicate vm_normal_page everywhere" issue I
> had. I'm still not quite happy with it ;)
> 
> How about taking a different approach. How about also having a pte_normal()
> function. Each architecture that has a pte special bit would make this
> !pte_special, and those that don't would return 0. They return 0 from both
> pte_special and pte_normal because they don't know whether the pte is
> special or normal.
> 
> Then vm_normal_page would become:
> 
>     if (pte_special(pte))
>         return NULL;
>     else if (pte_normal(pte))
>         return pte_page(pte);
> 
>     ... /* vma based scheme */

Hmm, it's not *quite* trivial as that for one important case:
vm_insert_mixed. Because we don't actually have a pte yet, so we can't
easily reuse insert_page / insert_pfn, rather we have to build the pte
first and then check it (patch attached, but I think it is a step
backwards)...

Really, I don't think either of my two approaches or your approach is
really a fundamentally different _abstraction_. It basically just has
to accommodate 2 different code paths no matter how you look at it. I
don't know how this is different to, say, conditionally compiling eg.
the FLATMEM/SPARSEMEM memory model code, or rwsem code, depending on
whether an architecture has defined some symbol. It happens all over
the kernel.

Actually, I'd argue it is _better_ than that, because the logic stays
in one place (one screenful, even), and away from abuse or divergence
by arch code.

If one actually came up with a new API that handles both cases better,
I'd say that is a different abstraction. Or if you could come up with
some different arch functions which would allow vm_normal_page to be
streamlined to read more like a regular C function, that should be a
different abstraction...

I'm still keen on my first patch. I know it isn't beautiful, but I
think it is better than the alternatives.

---

mm: add vm_insert_mixed

vm_insert_mixed will insert either a raw pfn or a refcounted struct page
into the page tables, depending on whether vm_normal_page() will return
the page or not. With the introduction of the new pte bit, this is now
a bit too tricky for drivers to be doing themselves.

filemap_xip uses this in a subsequent patch.

---
Index: linux-2.6/include/linux/mm.h
===================================================================

--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -1097,6 +1097,8 @@ int remap_pfn_range(struct vm_area_struc
 int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
+int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
+			unsigned long pfn);
 
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
 			unsigned int foll_flags);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1282,6 +1282,53 @@ out:
 }
 EXPORT_SYMBOL(vm_insert_pfn);
 
+int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
+			unsigned long pfn)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int retval;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+
+	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return -EFAULT;
+
+	retval = -ENOMEM;
+	pte = get_locked_pte(mm, addr, &ptl);
+	if (!pte)
+		goto out;
+	retval = -EBUSY;
+	if (!pte_none(*pte))
+		goto out_unlock;
+
+	entry = pte_mkspecial(pfn_pte(pfn, prot));
+	/*
+	 * If we don't have pte special, then we have to use the pfn_valid()
+	 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
+	 * refcount the page if pfn_valid is true. Otherwise we can *always*
+	 * avoid refcounting the page if we have pte_special.
+	 */
+	if (!pte_special(entry) && pfn_valid(pfn)) {
+		struct page *page;
+
+		page = pfn_to_page(pfn);
+		get_page(page);
+		inc_mm_counter(mm, file_rss);
+		page_add_file_rmap(page);
+	}
+	/* Ok, finally just insert the thing.. */
+	set_pte_at(mm, addr, pte, entry);
+
+	retval = 0;
+out_unlock:
+	pte_unmap_unlock(pte, ptl);
+out:
+	return retval;
+}
+EXPORT_SYMBOL(vm_insert_mixed);
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results
-
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html