On 01/12/2015 04:31 PM, Ed Swierk wrote:
I'm trying to track down a strange problem affecting an O32 userspace program on an N64 MIPS kernel. I'm using a 64-bit kernel from Cavium that's nominally 3.10.20 but has an assortment of patches, including rt and grsec and various Cavium stuff.
If you have all that stuff, you should also have access to the OCTEON simulator, that can produce an instruction trace...
If you can create a testcase that fails within fewer than 10^9 instructions on the simulator (on say fewer than 4 CPUs), then it would be child's play to find the problem...
David Daney
My glibc is stock 2.19-13 from Debian. The program is written in Go and compiled with gccgo from gcc 4.9.2. It is using the exec.Command API which is a Go wrapper for fork(), exec(), wait() and friends. The runtime library (libgo) was changed sometime before 4.9.2 to call clone() rather than fork() (see http://patchwork.ozlabs.org/patch/386411/). Presumably for expediency, the library invokes clone() indirectly via syscall(). Complicating matters, the clone() calls are invoked from different threads, so the program also has to deal with handling SIGCHLD whenever one of its child processes exits. Most of the time, the indirect clone() call works just fine. Occasionally, however, the clone() gets interrupted by a signal. When the signal handler returns, the kernel tries to restart the clone() syscall by rolling back the program counter and various registers, and jumping back into userspace at the point the syscall was first originally called. Running my program under strace looks like this (minus noise from other processes/threads): 2530 syscall(0x1018, 0x12, 0, 0, 0, 0, 0 <unfinished ...> 2532 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3113, si_uid=0, si_status=0, si_utime=0, si_stime=0} --- 2530 <... syscall resumed> ) = ? ERESTARTNOINTR (To be restarted) 2530 syscall(0x12, 0, 0, 0, 0, 0, 0) = -1 ENOSYS (Function not implemented) 2532 rt_sigreturn( <unfinished ...> 2532 <... rt_sigreturn resumed> ) = 2 syscall(0x1018) (where 0x1018 is the syscall number for clone on 32-bit MIPS) first returns ERESTARTNOINTR (as expected, this never actually propagates back to userspace). But the next attempt uses syscall number 0x22, which returns ENOSYS because there's no such syscall. I assume it is no coincidence that 0x12 is the first argument to the original syscall. For comparison, when I compile my program against the original libgo which calls fork() and run it under strace I see the following: 16791 clone( <unfinished ...> 16792 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=17006, si_uid=0, si_status=0, si_utime=0, si_stime=0} --- 16792 rt_sigreturn( <unfinished ...> 16791 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x3bc843c8) = ? ERESTARTNOINTR (To be restarted) 16792 <... rt_sigreturn resumed> ) = 0 16791 clone( <unfinished ...> 16791 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x3bc843c8) = 17008 Note that the second call to clone() has exactly the same arguments as the first one, and returns the new PID as expected. I spent quite some time digging into the syscall code in the kernel and glibc, but couldn't figure out who is supposed to shift arguments and push some of them to the stack and others to registers, and so on. I tried the same experiment with a 64-bit little-endian userspace from the Debian mips64el repository and a gcc 4.9.2 toolchain targeting mips64el. The program works fine. So the problem appears limited to O32 userspace on N64 kernel (not clear whether endianness is an issue). I can prepare a self-contained test case, but thought I'd first ask if this symptom rings a bell with anyone on the list. --Ed