On Fri, Jan 18, 2008 at 11:46:22PM +0100, Nick Piggin wrote: > On Fri, Jan 18, 2008 at 08:41:22AM -0800, Linus Torvalds wrote: > > > Then, just have a library version of the long form, and make architectures > > that don't support it just use that (just so that you don't have to > > duplicate that silly thing). So an architecture that support special page > > flags would do somethiing like > > > > #define set_special_page(vma,addr,ptep,pfn,prot) \ > > set_pte_at(vma, addr, ptep, mk_special_pte(pfn,prot)) > > #define vm_normal_page(vma,addr,pte) > > (pte_special(pte) ? NULL : pte_page(pte)) > > > > and other architectures would just do > > > > #define set_special_page(vma,addr,ptep,pfn,prot) \ > > set_pte_at(vma, addr, ptep, mk_pte(pfn,prot)) > > #define vm_normal_page(vma,addr,pte) \ > > generic_vm_normal_page(vma,addr,pte) > > > > or something. > > > > THAT is what I mean by "no #ifdef's in code" - that the selection is done > > at a higher level, the same way we have good interfaces with clear > > *conceptual* meaning for all the other PTE accessing stuff, rather than > > have conditionals in the architecture-independent code. > > OK, that gets around the "duplicate vm_normal_page everywhere" issue I > had. I'm still not quite happy with it ;) > > How about taking a different approach. How about also having a pte_normal() > function. Each architecture that has a pte special bit would make this > !pte_special, and those that don't would return 0. They return 0 from both > pte_special and pte_normal because they don't know whether the pte is > special or normal. > > Then vm_normal_page would become: > > if (pte_special(pte)) > return NULL; > else if (pte_normal(pte)) > return pte_page(pte); > > ... /* vma based scheme */ Hmm, it's not *quite* trivial as that for one important case: vm_insert_mixed. Because we don't actually have a pte yet, so we can't easily reuse insert_page / insert_pfn, rather we have to build the pte first and then check it (patch attached, but I think it is a step backwards)... Really, I don't think either of my two approaches or your approach is really a fundamentally different _abstraction_. It basically just has to accommodate 2 different code paths no matter how you look at it. I don't know how this is different to, say, conditionally compiling eg. the FLATMEM/SPARSEMEM memory model code, or rwsem code, depending on whether an architecture has defined some symbol. It happens all over the kernel. Actually, I'd argue it is _better_ than that, because the logic stays in one place (one screenful, even), and away from abuse or divergence by arch code. If one actually came up with a new API that handles both cases better, I'd say that is a different abstraction. Or if you could come up with some different arch functions which would allow vm_normal_page to be streamlined to read more like a regular C function, that should be a different abstraction... I'm still keen on my first patch. I know it isn't beautiful, but I think it is better than the alternatives. --- mm: add vm_insert_mixed vm_insert_mixed will insert either a raw pfn or a refcounted struct page into the page tables, depending on whether vm_normal_page() will return the page or not. With the introduction of the new pte bit, this is now a bit too tricky for drivers to be doing themselves. filemap_xip uses this in a subsequent patch. --- Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -1097,6 +1097,8 @@ int remap_pfn_range(struct vm_area_struc int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); +int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn); struct page *follow_page(struct vm_area_struct *, unsigned long address, unsigned int foll_flags); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -1282,6 +1282,53 @@ out: } EXPORT_SYMBOL(vm_insert_pfn); +int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, + unsigned long pfn) +{ + struct mm_struct *mm = vma->vm_mm; + int retval; + pte_t *pte, entry; + spinlock_t *ptl; + + BUG_ON(!(vma->vm_flags & VM_MIXEDMAP)); + + if (addr < vma->vm_start || addr >= vma->vm_end) + return -EFAULT; + + retval = -ENOMEM; + pte = get_locked_pte(mm, addr, &ptl); + if (!pte) + goto out; + retval = -EBUSY; + if (!pte_none(*pte)) + goto out_unlock; + + entry = pte_mkspecial(pfn_pte(pfn, prot)); + /* + * If we don't have pte special, then we have to use the pfn_valid() + * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must* + * refcount the page if pfn_valid is true. Otherwise we can *always* + * avoid refcounting the page if we have pte_special. + */ + if (!pte_special(entry) && pfn_valid(pfn)) { + struct page *page; + + page = pfn_to_page(pfn); + get_page(page); + inc_mm_counter(mm, file_rss); + page_add_file_rmap(page); + } + /* Ok, finally just insert the thing.. */ + set_pte_at(mm, addr, pte, entry); + + retval = 0; +out_unlock: + pte_unmap_unlock(pte, ptl); +out: + return retval; +} +EXPORT_SYMBOL(vm_insert_mixed); + /* * maps a range of physical memory into the requested pages. the old * mappings are removed. any references to nonexistent pages results - To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html