On Tue, Oct 17, 2017 at 09:58:59AM -0400, Boris Ostrovsky wrote: > On 10/17/2017 01:24 AM, Josh Poimboeuf wrote: > > On Mon, Oct 16, 2017 at 02:18:48PM -0400, Boris Ostrovsky wrote: > >> On 10/12/2017 03:53 PM, Boris Ostrovsky wrote: > >>> On 10/12/2017 03:27 PM, Andrew Cooper wrote: > >>>> On 12/10/17 20:11, Boris Ostrovsky wrote: > >>>>> There is also another problem: > >>>>> > >>>>> [ 1.312425] general protection fault: 0000 [#1] SMP > >>>>> [ 1.312901] Modules linked in: > >>>>> [ 1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6 > >>>>> [ 1.313878] task: ffff88003e2c0000 task.stack: ffffc9000038c000 > >>>>> [ 1.314360] RIP: 10000e030:entry_SYSCALL_64_fastpath+0x1/0xa5 > >>>>> [ 1.314854] RSP: e02b:ffffc9000038ff50 EFLAGS: 00010046 > >>>>> [ 1.315336] RAX: 000000000000000c RBX: 000055f550168040 RCX: > >>>>> 00007fcfc959f59a > >>>>> [ 1.315827] RDX: 0000000000000000 RSI: 0000000000000000 RDI: > >>>>> 0000000000000000 > >>>>> [ 1.316315] RBP: 000000000000000a R08: 000000000000037f R09: > >>>>> 0000000000000064 > >>>>> [ 1.316805] R10: 000000001f89cbf5 R11: ffff88003e2c0000 R12: > >>>>> 00007fcfc958ad60 > >>>>> [ 1.317300] R13: 0000000000000000 R14: 000055f550185954 R15: > >>>>> 0000000000001000 > >>>>> [ 1.317801] FS: 0000000000000000(0000) GS:ffff88003f800000(0000) > >>>>> knlGS:0000000000000000 > >>>>> [ 1.318267] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>>>> [ 1.318750] CR2: 00007fcfc97ab218 CR3: 000000003c88e000 CR4: > >>>>> 0000000000042660 > >>>>> [ 1.319235] Call Trace: > >>>>> [ 1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48 > >>>>> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00 > >>>>> 00 50 <ff> 15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5 > >>>>> [ 1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: ffffc9000038ff50 > >>>>> [ 1.344255] ---[ end trace d7cb8cd6cd7c294c ]--- > >>>>> [ 1.345009] Kernel panic - not syncing: Attempted to kill init! > >>>>> exitcode=0x0000000b > >>>>> > >>>>> > >>>>> All code > >>>>> ======== > >>>>> 0: 51 push %rcx > >>>>> 1: 50 push %rax > >>>>> 2: 57 push %rdi > >>>>> 3: 56 push %rsi > >>>>> 4: 52 push %rdx > >>>>> 5: 51 push %rcx > >>>>> 6: 6a da pushq $0xffffffffffffffda > >>>>> 8: 41 50 push %r8 > >>>>> a: 41 51 push %r9 > >>>>> c: 41 52 push %r10 > >>>>> e: 41 53 push %r11 > >>>>> 10: 48 83 ec 30 sub $0x30,%rsp > >>>>> 14: 65 4c 8b 1c 25 c0 d2 mov %gs:0xd2c0,%r11 > >>>>> 1b: 00 00 > >>>>> 1d: 41 f7 03 df 39 08 90 testl $0x900839df,(%r11) > >>>>> 24: 0f 85 a5 00 00 00 jne 0xcf > >>>>> 2a: 50 push %rax > >>>>> 2b:* ff 15 9c 95 d0 ff callq *-0x2f6a64(%rip) # > >>>>> 0xffffffffffd095cd <-- trapping instruction > >>>>> 31: 58 pop %rax > >>>>> 32: 48 3d 4c 01 00 00 cmp $0x14c,%rax > >>>>> 38: 77 0f ja 0x49 > >>>>> 3a: 4c 89 d1 mov %r10,%rcx > >>>>> 3d: ff .byte 0xff > >>>>> 3e: 14 c5 adc $0xc5,%al > >>>>> > >>>>> > >>>>> so the original 'cli' was replaced with the pv call but to me the offset > >>>>> looks a bit off, no? Shouldn't it always be positive? > >>>> callq takes a 32bit signed displacement, so jumping back by up to 2G is > >>>> perfectly legitimate. > >>> Yes, but > >>> > >>> ostr@workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath > >>> ffffffff817365dd t entry_SYSCALL_64_fastpath > >>> ostr@workbase> nm vmlinux | grep " pv_irq_ops" > >>> ffffffff81c2dbc0 D pv_irq_ops > >>> ostr@workbase> > >>> > >>> so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I > >>> didn't mean that x86 instruction set doesn't allow negative > >>> displacement, I was trying to say that pv_irq_ops always live further down) > >> I believe the problem is this: > >> > >> #define PV_INDIRECT(addr) *addr(%rip) > >> > >> The displacement that the linker computes will be relative to the where > >> this instruction is placed at the time of linking, which is in > >> .pv_altinstructions (and not .text). So when we copy it into .text the > >> displacement becomes bogus. > > apply_alternatives() is supposed to adjust that displacement based on > > the new IP, though it could be messing that up somehow. (See patch > > 10/13.) > > > > That patch doesn't take into account the fact that replacement > instructions may have to save/restore registers. So, for example, > > > - if (a->replacementlen && is_jmp(replacement[0])) > + } else if (a->replacementlen == 6 && *insnbuf == 0xff && > + *(insnbuf+1) == 0x15) { > + /* indirect call */ > + *(s32 *)(insnbuf + 2) += replacement - instr; > + DPRINTK("Fix indirect CALL offset: 0x%x, CALL *0x%lx", > + *(s32 *)(insnbuf + 2), > + (unsigned long)instr + *(s32 *)(insnbuf + 2) + 6); > + > > doesn't do the adjustment of > > 2a: 50 push %rax > 2b:* ff 15 9c 95 d0 ff callq *-0x2f6a64(%rip) > 31: 58 pop %rax > > because instbuf points to 'push' and not to 'call'. Ah. I forgot that asm paravirt patches put the saves/restores _in_ the replacement, whereas in C code they're _outside_ the replacement. Changing PV_INDIRECT to use absolute addressing would be a decent fix, but I think that would break the PIE support Thomas Garnier has been working on. Maybe we can add a new field to the alternatives entry struct which specifies the offset to the CALL instruction, so apply_alternatives() can find it. -- Josh _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization