I have written a program using C which has two threads. Initially it was, for(int i=0;i<n;i++){ long_operation(arr[i]); } then I divided the loop into two threads, two execute concurrently. One thread will carry out the operation for arr[0] to arr[n/2], another thread will work for arr[n/2] to arr[n-1]. long_operation function is thread safe. Initially I was using join but it was taking higher sys time for futex system call, which I observed using strace command. So i removed the join and use two volatile variable in the two threads to keep track whether thread is completed or not and a busy loop in the thread spawning function to halt the execution of critical section. And I made the thread detachable. It improved performance a little bit. but when i used time command, the sys part was taking, real 0m31.368s user 0m53.738s sys 0m15.203s but when i checked using the strace command the output was, % time seconds usecs/call calls errors syscall 55.79 0.000602 9 66 clone 44.21 0.000477 3 177 write ------ ----------- ----------- --------- --------- --------------- 100.00 0.001079 243 total So the time command was showing that around 15 seconds CPU spend in kernel within the process. But the strace command showing almost 0 seconds was utilized for system calls. Then why 15 seconds was wasted in kernel? I have an dual-core hyper-threaded Intel CPU. and kernel version 4.7.9