[Bug 217562] New: kernel NULL pointer dereference on deletion of guest physical memory slot

bugzilla-daemon@xxxxxxxxxx · Fri, 16 Jun 2023 16:02:45 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=217562

            Bug ID: 217562
           Summary: kernel NULL pointer dereference on deletion of guest
                    physical memory slot
           Product: Virtualization
           Version: unspecified
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: kvm
          Assignee: virtualization_kvm@xxxxxxxxxxxxxxxxxxxx
          Reporter: arnaud.lefebvre@xxxxxxxxxxxxxxxx
        Regression: No

Created attachment 304438
  --> https://bugzilla.kernel.org/attachment.cgi?id=304438&action=edit
dmesg logs with decoded backtrace

Hello,

We've been having this BUG for the last 6 months on both Intel and AMD hosts
without being able to reproduce it on demand. The issue also occurs randomly:

[Mon Jun 12 10:50:08 UTC 2023] BUG: kernel NULL pointer dereference, address:
0000000000000008
[Mon Jun 12 10:50:08 UTC 2023] #PF: supervisor write access in kernel mode
[Mon Jun 12 10:50:08 UTC 2023] #PF: error_code(0x0002) - not-present page
[Mon Jun 12 10:50:08 UTC 2023] PGD 0 P4D 0
[Mon Jun 12 10:50:08 UTC 2023] Oops: 0002 [#1] SMP NOPTI
[Mon Jun 12 10:50:08 UTC 2023] CPU: 88 PID: 856806 Comm: qemu Kdump: loaded Not
tainted 5.15.115 #1
[Mon Jun 12 10:50:08 UTC 2023] Hardware name: MCT         Capri                
         /Capri           , BIOS V2010 04/19/2022
[Mon Jun 12 10:50:08 UTC 2023] RIP: 0010:__handle_changed_spte+0x5f3/0x670
[Mon Jun 12 10:50:08 UTC 2023] Code: b8 a8 00 00 00 e9 4d be 0f 00 4d 8d be 60
6a 01 00 4c 89 44 24 08 4c 89 ff e8 69 30 43 01 4c 8b 44 24 08 49 8b 40 08 49
8b 10 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 00 48 83
[Mon Jun 12 10:50:09 UTC 2023] RSP: 0018:ffffc90029477840 EFLAGS: 00010246
[Mon Jun 12 10:50:09 UTC 2023] RAX: 0000000000000000 RBX: ffff89581a1f6000 RCX:
0000000000000000
[Mon Jun 12 10:50:09 UTC 2023] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
ffffc9002d858a60
[Mon Jun 12 10:50:09 UTC 2023] RBP: 0000000000002200 R08: ffff893450426450 R09:
0000000000000002
[Mon Jun 12 10:50:09 UTC 2023] R10: 0000000000000001 R11: 0000000000000001 R12:
0000000000000001
[Mon Jun 12 10:50:09 UTC 2023] R13: 00000000000005a0 R14: ffffc9002d842000 R15:
ffffc9002d858a60
[Mon Jun 12 10:50:09 UTC 2023] FS:  00007fdb6c1ff6c0(0000)
GS:ffff89804d800000(0000) knlGS:0000000000000000
[Mon Jun 12 10:50:09 UTC 2023] CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[Mon Jun 12 10:50:09 UTC 2023] CR2: 0000000000000008 CR3: 0000005380534005 CR4:
0000000000770ee0
[Mon Jun 12 10:50:09 UTC 2023] PKRU: 55555554
[Mon Jun 12 10:50:09 UTC 2023] Call Trace:
[Mon Jun 12 10:50:09 UTC 2023]  <TASK>
[Mon Jun 12 10:50:09 UTC 2023]  ? __die+0x50/0x8d
[Mon Jun 12 10:50:09 UTC 2023]  ? page_fault_oops+0x184/0x2f0
[Mon Jun 12 10:50:09 UTC 2023]  ? exc_page_fault+0x535/0x7d0
[Mon Jun 12 10:50:09 UTC 2023]  ? asm_exc_page_fault+0x22/0x30
[Mon Jun 12 10:50:09 UTC 2023]  ? __handle_changed_spte+0x5f3/0x670
[Mon Jun 12 10:50:09 UTC 2023]  ? update_load_avg+0x73/0x560
[Mon Jun 12 10:50:09 UTC 2023]  __handle_changed_spte+0x3ae/0x670
[Mon Jun 12 10:50:09 UTC 2023]  __handle_changed_spte+0x3ae/0x670
[Mon Jun 12 10:50:09 UTC 2023]  zap_gfn_range+0x21a/0x320
[Mon Jun 12 10:50:09 UTC 2023]  kvm_tdp_mmu_zap_invalidated_roots+0x50/0xa0
[Mon Jun 12 10:50:09 UTC 2023]  kvm_mmu_zap_all_fast+0x178/0x1b0
[Mon Jun 12 10:50:09 UTC 2023]  kvm_page_track_flush_slot+0x4f/0x90
[Mon Jun 12 10:50:09 UTC 2023]  kvm_set_memslot+0x32b/0x8e0
[Mon Jun 12 10:50:09 UTC 2023]  kvm_delete_memslot+0x58/0x80
[Mon Jun 12 10:50:09 UTC 2023]  __kvm_set_memory_region+0x3c4/0x4a0
[Mon Jun 12 10:50:09 UTC 2023]  kvm_vm_ioctl+0x3d1/0xea0
[Mon Jun 12 10:50:09 UTC 2023]  __x64_sys_ioctl+0x8b/0xc0
[Mon Jun 12 10:50:09 UTC 2023]  do_syscall_64+0x3f/0x90
[Mon Jun 12 10:50:09 UTC 2023]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[Mon Jun 12 10:50:09 UTC 2023] RIP: 0033:0x7fdc71e3a5ef
[Mon Jun 12 10:50:09 UTC 2023] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7
04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00
0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[Mon Jun 12 10:50:09 UTC 2023] RSP: 002b:00007fdb6c1fc920 EFLAGS: 00000246
ORIG_RAX: 0000000000000010
[Mon Jun 12 10:50:09 UTC 2023] RAX: ffffffffffffffda RBX: 000000004020ae46 RCX:
00007fdc71e3a5ef
[Mon Jun 12 10:50:09 UTC 2023] RDX: 00007fdb6c1fca40 RSI: 000000004020ae46 RDI:
000000000000000d
[Mon Jun 12 10:50:09 UTC 2023] RBP: 00007fdc714fa000 R08: 0000000000000002 R09:
00007fdc6f7e8e10
[Mon Jun 12 10:50:09 UTC 2023] R10: 00007fdc724966e8 R11: 0000000000000246 R12:
00007fdb6c1fca40
[Mon Jun 12 10:50:09 UTC 2023] R13: 0000000001000000 R14: 00007fdb68400000 R15:
00000000fd000000
[Mon Jun 12 10:50:09 UTC 2023]  </TASK>
[Mon Jun 12 10:50:09 UTC 2023] CR2: 0000000000000008
[Mon Jun 12 10:50:09 UTC 2023] ---[ end trace 353e5ae9ef11cd10 ]---
[Mon Jun 12 10:50:09 UTC 2023] RIP: 0010:__handle_changed_spte+0x5f3/0x670
[Mon Jun 12 10:50:09 UTC 2023] Code: b8 a8 00 00 00 e9 4d be 0f 00 4d 8d be 60
6a 01 00 4c 89 44 24 08 4c 89 ff e8 69 30 43 01 4c 8b 44 24 08 49 8b 40 08 49
8b 10 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 00 48 83
[Mon Jun 12 10:50:09 UTC 2023] RSP: 0018:ffffc90029477840 EFLAGS: 00010246
[Mon Jun 12 10:50:09 UTC 2023] RAX: 0000000000000000 RBX: ffff89581a1f6000 RCX:
0000000000000000
[Mon Jun 12 10:50:09 UTC 2023] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
ffffc9002d858a60
[Mon Jun 12 10:50:09 UTC 2023] RBP: 0000000000002200 R08: ffff893450426450 R09:
0000000000000002
[Mon Jun 12 10:50:09 UTC 2023] R10: 0000000000000001 R11: 0000000000000001 R12:
0000000000000001
[Mon Jun 12 10:50:09 UTC 2023] R13: 00000000000005a0 R14: ffffc9002d842000 R15:
ffffc9002d858a60
[Mon Jun 12 10:50:09 UTC 2023] FS:  00007fdb6c1ff6c0(0000)
GS:ffff89804d800000(0000) knlGS:0000000000000000
[Mon Jun 12 10:50:09 UTC 2023] CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[Mon Jun 12 10:50:09 UTC 2023] CR2: 0000000000000008 CR3: 0000005380534005 CR4:
0000000000770ee0
[Mon Jun 12 10:50:09 UTC 2023] PKRU: 55555554

We've seen this issue with kernel 5.15.115, 5.15.79, some versions between the
two, and probably a 5.15.4x (not sure here). At the beginning, only a few
"identical" hosts (same hardware model) had this issue but since then we've
also had crashes on hosts running different hardware. Unfortunately, it
sometimes takes a  few weeks to trigger (last occurrence before this one was 2
months ago) and we can't really think of a way to reproduce this.

As you can see in the dmesg.log.gz file, this bug then creates soft lockups for
other processes, I guess because they wait for some kind of lock that never
gets released. The host then becomes more and more unresponsive as time goes
by.

Let me know if I can provide any other details.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.