On Fri, Jun 12, 2020 at 12:18 PM afzal mohammed <afzal.mohd.ma@xxxxxxxxx> wrote: > > copy_{from,to}_user() uaccess helpers are implemented by user page > pinning, followed by temporary kernel mapping & then memcpy(). This > helps to achieve user page copy when current virtual address mapping > of the CPU excludes user pages. > > Performance wise, results are not encouraging, 'dd' on tmpfs results, > > ARM Cortex-A8, BeagleBone White (256MiB RAM): > w/o series - ~29.5 MB/s > w/ series - ~20.5 MB/s > w/ series & highmem disabled - ~21.2 MB/s > > On Cortex-A15(2GiB RAM) in QEMU: > w/o series - ~4 MB/s > w/ series - ~2.6 MB/s > > Roughly a one-third drop in performance. Disabling highmem improves > performance only slightly. > > 'hackbench' also showed a similar pattern. > > uaccess routines using page pinning & temporary kernel mapping is not > something new, it has been done long long ago by Ingo [1] as part of > 4G/4G user/kernel mapping implementation on x86, though not merged in > mainline. > > [1] https://lore.kernel.org/lkml/Pine.LNX.4.44.0307082332450.17252-100000@localhost.localdomain/ Nice changelog text! I agree the performance drop is not great. There are probably some things that can be done to optimize it, but I guess most of the overhead is from the page table operations and cannot be avoided. What was the exact 'dd' command you used, in particular the block size? Note that by default, 'dd' will request 512 bytes at a time, so you usually only access a single page. It would be interesting to see the overhead with other typical or extreme block sizes, e.g. '1', '64', '4K', '64K' or '1M'. If you want to drill down into where exactly the overhead is (i.e. get_user_pages or kmap_atomic, or something different), using 'perf record dd ..', and 'perf report' may be helpful. > +static int copy_chunk_from_user(unsigned long from, int len, void *arg) > +{ > + unsigned long *to_ptr = arg, to = *to_ptr; > + > + memcpy((void *) to, (void *) from, len); > + *to_ptr += len; > + return 0; > +} > + > +static int copy_chunk_to_user(unsigned long to, int len, void *arg) > +{ > + unsigned long *from_ptr = arg, from = *from_ptr; > + > + memcpy((void *) to, (void *) from, len); > + *from_ptr += len; > + return 0; > +} Will gcc optimize away the indirect function call and inline everything? If not, that would be a small part of the overhead. > +unsigned long gup_kmap_copy_from_user(void *to, const void __user *from, unsigned long n) > +{ > + struct page **pages; > + int num_pages, ret, i; > + > + if (uaccess_kernel()) { > + memcpy(to, (__force void *)from, n); > + return 0; > + } > + > + num_pages = DIV_ROUND_UP((unsigned long)from + n, PAGE_SIZE) - > + (unsigned long)from / PAGE_SIZE; Make sure this doesn't turn into actual division operations but uses shifts. It might even be clearer here to open-code the shift operation so readers can see what this is meant to compile into. > + pages = kmalloc_array(num_pages, sizeof(*pages), GFP_KERNEL | __GFP_ZERO); > + if (!pages) > + goto end; Another micro-optimization may be to avoid the kmalloc for the common case, e.g. anything with "num_pages <= 64", using an array on the stack. > + ret = get_user_pages_fast((unsigned long)from, num_pages, 0, pages); > + if (ret < 0) > + goto free_pages; > + > + if (ret != num_pages) { > + num_pages = ret; > + goto put_pages; > + } I think this is technically incorrect: if get_user_pages_fast() only gets some of the pages, you should continue with the short buffer and return the number of remaining bytes rather than not copying anything. I think you did that correctly for a failed kmap_atomic(), but this has to use the same logic. Arnd