Dear Kernel Hackers, I'm Jason Cai, a kernel developer from Dell EMC. I hit the same issue as the one Lennart Sorensen sent at Dec 19, 2016. I narrow down the issue now. It seems that an unexpected DNA (Device not Available) may be triggered in the `execve` code path. Specifically, it exists between `setup_new_exec()` and `start_thread()` in file `load_elf_binary()`. I've added a BUG_ON() just before `start_thread` in `load_elf_binary ` to assert the fpu status of the current process descriptor should be clean when performing an exec. It gets triggered and the stack is as the following: ----------------------------------------------------------------------------- (E3)[ 1517.089157] current is bad: ffff8812227387c0 (abuse) (E3)[ 1517.089176] prev: fpu=ffff8811d846c100, fpu_src=ffff8817fbab7500, fpu_fork=ffff880bf5513740, fpu_exec= (null) (E3)[ 1517.089190] has_fpu=1, fpu_counter=1, flags=402000, CR0=80050033 (E0)[ 1517.089223] ------------[ cut here ]------------ (E2)[ 1517.095250] kernel BUG at linux-3.2/fs/binfmt_elf.c:1064! (U0)(MSG-KERN-00005):[ 1517.106894] invalid opcode: 0000 [#1] SMP (E4)[ 1517.114030] CPU 23 (E4)[ 1517.117055] Modules linked in: ... (E4)[ 1517.192079] (E4)[ 1517.194621] Pid: 29746, comm: abuse Tainted: P O 3.2.33 (E4)[ 1517.207783] RIP: 0010:[<ffffffff81129670>] [<ffffffff81129670>] load_elf_binary+0x1858/0x1983 (E4)[ 1517.218284] RSP: 0018:ffff8817fa15fd08 EFLAGS: 00010292 (E4)[ 1517.225087] RAX: 0000000000000053 RBX: ffff8812227387c0 RCX: 0000000081000000 (E4)[ 1517.233924] RDX: 0000000081000000 RSI: 0000000000000046 RDI: ffffffff81721140 (E4)[ 1517.242761] RBP: ffff8817fa15fe18 R08: 0000000000000000 R09: 000000020fc00000 (E4)[ 1517.251597] R10: ffff88187a15fc17 R11: 0000000000000000 R12: ffff880622e3ef80 (E4)[ 1517.260432] R13: ffff8811c4333400 R14: ffff8812227387c0 R15: ffff8817fa15ff58 (E4)[ 1517.269269] FS: 0000000000000000(0000) GS:ffff88183fd60000(0000) knlGS:0000000000000000 (E4)[ 1517.279169] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 (E4)[ 1517.286455] CR2: 00007fbca10dcba8 CR3: 00000011dd8a7000 CR4: 00000000001407e0 (E4)[ 1517.295290] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 (E4)[ 1517.304125] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 (E4)[ 1517.312960] Process abuse (pid: 29746, threadinfo ffff8817fa15e000, task ffff8812227387c0) (E0)[ 1517.323055] Stack: (E4)[ 1517.326178] 0000000000000001 00007fffd47a98e8 00007fffd47a9988 ffff881200000008 (E4)[ 1517.335384] ffff880627961680 ffff8812227387c0 ffff8817fa15e000 ffff8817fa15e000 (E4)[ 1517.344586] ffff8817fa15e000 ffff8812227387c0 0000000000500988 0000000000500778 (E0)[ 1517.353798] Call Trace: (E4)[ 1517.357416] [<ffffffff810ec3a0>] search_binary_handler+0xd6/0x273 (E4)[ 1517.365196] [<ffffffff810edeed>] do_execve_common.clone.28+0x1e1/0x2e8 (E4)[ 1517.373458] [<ffffffff810ee00f>] do_execve+0x1b/0x1d (E4)[ 1517.379975] [<ffffffff810092b1>] sys_execve+0x49/0xe1 (E4)[ 1517.386589] [<ffffffff813a4b4c>] stub_execve+0x6c/0xc0 (E0)[ 1517.393293] Code: 81 31 c0 e8 c3 27 f1 ff 41 0f 20 c0 48 c7 c7 f0 49 51 81 8b 4b 14 0f b6 93 b8 01 00 00 48 8b b3 d8 04 00 00 31 c0 e8 a0 27 f1 ff <0f> 0b 49 8 b 95 98 00 00 00 48 8b 75 b8 4c 89 ff e8 ba 7d ed ff (U1)(MSG-KERN-00005):[ 1517.416621] RIP [<ffffffff81129670>] load_elf_binary+0x1858/0x1983 (E4)[ 1517.426164] RSP <ffff8817fa15fd08> (E4)[ 1517.430961] ---[ end trace 5dcaec314d0b0edb ]--- (U0)(MSG-KERN-00018):[ 1517.436994] Kernel panic - not syncing: Fatal exception (E4)[ 1517.445346] Pid: 29746, comm: abuse Tainted: P D O 3.2.33 (E4)[ 1517.454276] Call Trace: (E4)[ 1517.457893] [<ffffffff8139af77>] panic+0xb2/0x1d2 (E4)[ 1517.464122] [<ffffffff8103c75a>] ? kmsg_dump+0x5d/0xdf (E4)[ 1517.470825] [<ffffffff8139eb8a>] oops_end+0xae/0xbe (E4)[ 1517.477246] [<ffffffff81004b81>] die+0x5a/0x65 (E4)[ 1517.483185] [<ffffffff8139e6b8>] do_trap+0x121/0x130 (E4)[ 1517.489703] [<ffffffff81002a27>] do_invalid_op+0x96/0x9f (E4)[ 1517.496601] [<ffffffff81129670>] ? load_elf_binary+0x1858/0x1983 (E4)[ 1517.504280] [<ffffffff813a63f5>] invalid_op+0x15/0x20 (E4)[ 1517.510893] [<ffffffff81129670>] ? load_elf_binary+0x1858/0x1983 (E4)[ 1517.518575] [<ffffffff81129670>] ? load_elf_binary+0x1858/0x1983 (E4)[ 1517.526257] [<ffffffff810ec3a0>] search_binary_handler+0xd6/0x273 (E4)[ 1517.534035] [<ffffffff810edeed>] do_execve_common.clone.28+0x1e1/0x2e8 (E4)[ 1517.542289] [<ffffffff810ee00f>] do_execve+0x1b/0x1d (E4)[ 1517.548810] [<ffffffff810092b1>] sys_execve+0x49/0xe1 (E4)[ 1517.555427] [<ffffffff813a4b4c>] stub_execve+0x6c/0xc0 -------------------------------------------------------------------------------------------------- The kernel codes I'm testing are the same as the stable branch linux-3.2.y AFAIK, there is no FPU instructions between `setup_new_exec()` and `start_thread() ` in `load_elf_binary()`. The BUG_ON() codes are as the following: -------------------------------------------------------------------------------------------------- if ((current->thread.has_fpu) || current->fpu_counter || tsk_used_math(current)) { // printk some status related to FPU ... BUG_ON(1); } -------------------------------------------------------------------------------------------------- Maybe the quick fix is that simply doesn't free the FPU state in `start_thread_common`. Last but not least, by now, this issues can only be seen on the systems armed with Intel E5-2620v3 and E7-4880v2. Thus, I'm still wondering whether it's possible a CPU issue or something else? How can I verify it? I would greatly appreciate if you kindly give me some feedback. Best regards, Jason Cai