Re: [Xen-devel] Linux 4.19.5 fails to boot as Xen dom0

"Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx> · Thu, 29 Nov 2018 17:22:23 +0300

On Thu, Nov 29, 2018 at 01:35:17PM +0000, Juergen Gross wrote:
> On 29/11/2018 14:26, Kirill A. Shutemov wrote:
> > On Thu, Nov 29, 2018 at 09:41:25AM +0000, Juergen Gross wrote:
> >> On 29/11/2018 02:22, Hans van Kranenburg wrote:
> >>> Hi,
> >>>
> >>> As also seen at:
> >>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=914951
> >>>
> >>> Attached there are two serial console output logs. One is starting with
> >>> Xen 4.11 (from debian unstable) as dom0, and the other one without Xen.
> >>>
> >>> [    2.085543] BUG: unable to handle kernel paging request at
> >>> ffff888d9fffc000
> >>> [    2.085610] PGD 200c067 P4D 200c067 PUD 0
> >>> [    2.085674] Oops: 0000 [#1] SMP NOPTI
> >>> [    2.085736] CPU: 1 PID: 1 Comm: swapper/0 Not tainted
> >>> 4.19.0-trunk-amd64 #1 Debian 4.19.5-1~exp1+pvh1
> >>> [    2.085823] Hardware name: HP ProLiant DL360 G7, BIOS P68 05/21/2018
> >>> [    2.085895] RIP: e030:ptdump_walk_pgd_level_core+0x1fd/0x490
> >>> [...]
> >>
> >> The offending stable commit is 4074ca7d8a1832921c865d250bbd08f3441b3657
> >> ("x86/mm: Move LDT remap out of KASLR region on 5-level paging"), this
> >> is commit d52888aa2753e3063a9d3a0c9f72f94aa9809c15 upstream.
> >>
> >> Current upstream kernel is booting fine under Xen, so in general the
> >> patch should be fine. Using an upstream kernel built from above commit
> >> (with the then needed Xen fixup patch 1457d8cf7664f34c4ba534) is fine,
> >> too.
> >>
> >> Kirill, are you aware of any prerequisite patch from 4.20 which could be
> >> missing in 4.19.5?
> > 
> > I'm not.
> > 
> > Let me look into this.
> > 
> 
> What is making me suspicious is the failure happening just after
> releasing the init memory. Maybe there is an access to .init.data
> segment or similar? The native kernel booting could be related to the
> usage of 2M mappings not being available in a PV-domain.

Sounds like a valid hypothesis.

[ 2.085616] Code: 00 00 00 00 40 00 00 49 83 c5 08 48 01 04 24 4c 3b 6c 24 48 0f 84 83 02 00 00 48 8b 04 24 48 c1 f8
 10 48 89 84 24 88 00 00 00 <49> 8b 7d 00 48 f7 c7 9f ff ff ff 0f 85 36 ff ff ff 41 b8 03 00 00
All code
========
   0:   00 00                   add    %al,(%rax)
   2:   00 00                   add    %al,(%rax)
   4:   40 00 00                add    %al,(%rax)
   7:   49 83 c5 08             add    $0x8,%r13
   b:   48 01 04 24             add    %rax,(%rsp)
   f:   4c 3b 6c 24 48          cmp    0x48(%rsp),%r13
  14:   0f 84 83 02 00 00       je     0x29d
  1a:   48 8b 04 24             mov    (%rsp),%rax
  1e:   48 c1 f8 10             sar    $0x10,%rax
  22:   48 89 84 24 88 00 00    mov    %rax,0x88(%rsp)
  29:   00
  2a:*  49 8b 7d 00             mov    0x0(%r13),%rdi           <-- trapping instruction
  2e:   48 f7 c7 9f ff ff ff    test   $0xffffffffffffff9f,%rdi
  35:   0f 85 36 ff ff ff       jne    0xffffffffffffff71
  3b:   41                      rex.B
  3c:   b8                      .byte 0xb8
  3d:   03 00                   add    (%rax),%eax
        ...

Code starting with the faulting instruction
===========================================
   0:   49 8b 7d 00             mov    0x0(%r13),%rdi
   4:   48 f7 c7 9f ff ff ff    test   $0xffffffffffffff9f,%rdi
   b:   0f 85 36 ff ff ff       jne    0xffffffffffffff47
  11:   41                      rex.B
  12:   b8                      .byte 0xb8
  13:   03 00                   add    (%rax),%eax
        ...

Reading from %r13 causes the fault.

I don't have a setup to reproduce the issue myself and have hard time
correlate the code with source.

What is ptdump_walk_pgd_level_core+0x1fd/0x490 for you?

-- 
 Kirill A. Shutemov