Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit

Nick Piggin <npiggin@xxxxxxx> · Wed, 16 Jan 2008 04:38:00 +0100

On Sun, Jan 13, 2008 at 08:50:23AM -0800, Linus Torvalds wrote:
> 
> 
> On Sun, 13 Jan 2008, Nick Piggin wrote:
> > 
> > OK, you have to read to the 3rd paragraph.
> 
> That one doesn't actually say what the point is either. It just claims 
> that there *is* a point that isn't actually explained or mentioned 
> further.

OK, right after reading it again I concede it is unclear and doesn't
actually make the point.

My biggest concen is that s390 basically are thinking about doing
something like this:

       if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
                if (vma->vm_flags & VM_MIXEDMAP) {
#ifdef s390
			if (pte_special(pte))
				return NULL;
#else
                        if (!pfn_valid(pfn))
                                return NULL;
#endif
                        goto out;
                } else {
                        unsigned long off = (addr-vma->vm_start) >> PAGE_SHIFT;
                        if (pfn == vma->vm_pgoff + off)
                                return NULL;
                        if (!is_cow_mapping(vma->vm_flags))
                                return NULL;
                }
        }

Of course it is nicely wrapped in arch functions, but that is the
basic logic. Now what happens is that already gives us 2 paths through
that function which needs to be understood by developers, but it also
IMO is more confusing because it mixes the 2 schemes (vma based vs pte
based).

Basically what I want to do is make the pte_special part usable by
all architectures, and bring it out of the vma tests. I think if anything
it actually makes the code clearer to a novice reader because they can
quite easily see the intention of the more complex vma scheme, by
understanding what the pte scheme does (althogh the comments are pretty
good at doing that anyway). My point is just that I don't think it is
too horrible. It isn't like the poor developer who has to understand
the vma scheme will have much problem with the pte scheme...

> > Well the immediate improvement from this actual patch is just that
> > it gives better and smaller code for vm_normal_page (even if you
> > discount the debug checks in the existing code).
> 
> It does no such thing, except for the slow-path that doesn't matter.

Well it does do such a thing.

Here is the old one (with the sanity check ifdefed away). 88 bytes or so,
somewhat sparse in icache for the fastpath, which also has a short forward
jump:

vm_normal_page:
        movabsq $70368744177663, %rax   #, tmp66
        andq    %rax, %rdx      # tmp66, pfn
        movq    40(%rdi), %rax  # <variable>.vm_flags, D.23528
        shrq    $12, %rdx       #, pfn
        testl   $268436480, %eax        #, D.23528
        je      .L14    #,
        testl   $268435456, %eax        #, D.23528
        je      .L16    #,
        cmpq    end_pfn(%rip), %rdx     # end_pfn, pfn
        jb      .L14    #,
        jmp     .L18    #
.L16:
        subq    8(%rdi), %rsi   # <variable>.vm_start, addr
        shrq    $12, %rsi       #, addr
        addq    136(%rdi), %rsi # <variable>.vm_pgoff, addr
        cmpq    %rsi, %rdx      # addr, pfn
        je      .L18    #,
        andl    $40, %eax       #, tmp72
        cmpl    $32, %eax       #, tmp72
        jne     .L18    #,
.L14:
        imulq   $56, %rdx, %rax #, pfn, D.23534
        addq    mem_map(%rip), %rax     # mem_map, D.23534
        ret
.L18:
        xorl    %eax, %eax      # D.23534
        ret

The new one looks like this. It has no taken branches in the fastpath so
is dense in icache, is less than half the size, and works out of registers.
vm_normal_page:
        xorl    %eax, %eax      # D.23526
        testb   $2, %dh #, pte$pte
        jne     .L16    #,
        movabsq $70368744177663, %rax   #, tmp66
        andq    %rax, %rdx      # tmp66, pte$pte
        shrq    $12, %rdx       #, pte$pte
        imulq   $56, %rdx, %rax #, pte$pte, D.23526
        addq    mem_map(%rip), %rax     # mem_map, D.23526
.L16:
        ret

I think it is slightly better code even for the fastpath.

> So it may optimize the slow-path (PFNMAP/MIXMAP), but the real path stays 
> exactly the same from a code generation standpoint (just checking a 
> different bit, afaik). 

Well, possibly MIXMAP will become an important path too if you have a lot
of use of these XIP filesystems.

> If those pte_special bits are required for unexplained lockless 
> get_user_pages, is this going to get EVEN WORSE when s390 (again) cannot 
> do it?

Ah, my raising of the lockless get_user_pages was just to demonstrate
that some other architectures (like x86) would also like to have a
pte_special bit in their pte. So it isn't like we'd be using a pte bit
_simply_ to make this little improvement to vm_normal_page(), just that
the path isn't going to be an s390 only thing because we may get other
architectures with a pte_special bit for other good reasons.

lockless get_user_pages won't touch vm_normal_page, or even the existing
get_user_pages. Architectures that don't support it would just fall
through to regular get_user_pages...

-
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html