Hi Catalin, David, all > COMPAT_SYSCALL_WRAP2(creat, ...): > mov w0, w0 > b <sys_creat> > > > > Cost wise, this seems like it all cancels out in the end, but what > > > do I know? > > > > I think you know something, and I also think Heiko and other s390 guys > > know something as well. So I'd like to listen their arguments here. > > > > For me spark64 way is looking reasonable only because it's really simple > > and takes less coding. I'll try it on some branch and share here what happened. > > The kernel code will definitely look simpler ;). It would be good to see > if there actually is any performance impact. Even with 16 more cycles on > syscall entry, would they be lost in the noise? You don't need a full > implementation, just some dummy mov x0, x0 on the entry path. > > -- > Catalin I wrote a simple test: struct timeval start, end; unsigned long long ut; int main() { gettimeofday(&start, NULL); for (int i = 1000000; i; i--) syscall(__NR_getrusage, 100 /* EINVAL */, NULL); gettimeofday(&end, NULL); ut = (end.tv_sec - start.tv_sec) * 1000000ULL + end.tv_usec - start.tv_usec; printf("%lld\n", ut); exit(EXIT_SUCCESS); } In kernel there's minimal overhead: diff --git a/kernel/sys.c b/kernel/sys.c index 89d5be4..003d5ad 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1634,6 +1634,17 @@ COMPAT_SYSCALL_DEFINE2(getrusage, int, who, struct compat_rusage __user *, ru) { struct rusage r; + asm volatile ( + " mov w0, w0 \n" + " mov w1, w1 \n" + " mov w2, w2 \n" + " mov w3, w3 \n" + " mov w4, w4 \n" + " mov w5, w5 \n" + " mov w6, w6 \n" + " mov w7, w7 \n" + ); + if (who != RUSAGE_SELF && who != RUSAGE_CHILDREN && who != RUSAGE_THREAD) return -EINVAL; On QEMU: With MOVs: W/O MOVs: 832015 814564 840639 803165 830482 813116 832895 802928 832083 832658 834461 802993 829405 812465 846677 822651 828409 803393 836845 821470 828716 801044 831620 821301 825423 800278 829946 821476 We have 83 mS vs 81 mS, ~2.6% of performance degradation. And I can show bigger numbers if I'll use asm svc instead of syscall() wrapper which increases time as well. It's definitely more than 0, but not so big anyway. For syscalls with heavy payload it will be non-measurable. So the choice is still there. Should we use wrappers and save 2.5% of syscall performance. Or clear top-halves unconditionally and win in simplicity? If QEMU is looking non-representative, I can run test on real hardware, but it takes a time, and I think will end up with similar results. Latest kernel with wrappers and library are here: https://github.com/norov/linux/commits/ilp32 https://github.com/norov/glibc/commits/ilp32-dev BTW, notice the change in ABI: syscalls that take stat and statfs structures now routed to (wrapped) native handlers, after switching userspace to use 64-bit off_t, ino_t, blkcnt_t, fsblkcnt_t and fsfilcnt_t types. Yury. -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html