On Sat, Jun 13, 2020 at 2:04 PM afzal mohammed <afzal.mohd.ma@xxxxxxxxx> wrote: > On Fri, Jun 12, 2020 at 10:07:28PM +0200, Arnd Bergmann wrote: > > > I think a lot > > of usercopy calls are only for a few bytes, though this is of course > > highly workload dependent and you might only care about the large > > ones. > > Observation is that max. pages reaching copy_{from,to}_user() is 2, > observed maximum of n (number of bytes) being 1 page size. i think C > library cuts any size read, write to page size (if it exceeds) & > invokes the system call. Max. pages reaching 2, happens when 'n' > crosses page boundary, this has been observed w/ small size request > as well w/ ones of exact page size (but not page aligned). Right, this is apparently because tmpfs uses shmem_file_read_iter() to copy the file pages one at a time. generic_file_buffered_read() seems similar, to copying between an aligned kernel page and address in user space that is not page aligned would be an important case to optimize for. > Quickly comparing boot-time on Beagle Bone White, boot time increases > by only 4%, perhaps this worry is irrelevant, but just thought will > put it across. 4% boot time increase sounds like a lot, especially if that is only for copy_from_user/copy_to_user. In the end it really depends on how well get_user()/put_user() and small copies can be optimized in the end. > > There is also still hope of optimizing small aligned copies like > > > > set_ttbr0(user_ttbr); > > ldm(); > > set_ttbr0(kernel_ttbr); > > stm(); > > Hmm, more needs to be done to be in a position to test it. This is going to be highly microarchitecture specific, so anything you test on the Beaglebone's Cortex-A8 might not apply to A7/A15/A17 systems, but if you want to test what the overhead is, you could try changing /dev/zero (or a different chardev like it) to use a series of put_user(0, u32uptr++) in place of whatever it has, and then replace the 'str' instruction with dummy writes to ttbr0 using the value it already has, like: mcr p15, 0, %0, c2, c0, 0 /* set_ttbr0() */ isb /* prevent speculative access to kernel table */ str %1, [%2],0 /* write 32 bit to user space */ mcr p15, 0, %0, c2, c0, 0 /* set_ttbr0() */ isb /* prevent speculative access to user table */ This is obviously going to be very slow compared to the simple store there is today but maybe cheaper than the CONFIG_ARM64_SW_TTBR0_PAN uaccess_en/disable() on arm64 on a single get_user()/put_user(). It would be interesting to compare it to the overhead of a get_user_page_fast() based implementation. From the numbers you measured, it seems the beaglebone currently needs an extra ~6µs or 3µs per copy_to/from_user() call with your patch, depending on what your benchmark was (MB/s for just reading or writing vs MB/s for copying from one file to another through a user space buffer). Arnd