[Bug 201825] New: Guest freezes during high IO load, host partially freezes, stack points to try_async_pf

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



https://bugzilla.kernel.org/show_bug.cgi?id=201825

            Bug ID: 201825
           Summary: Guest freezes during high IO load, host partially
                    freezes, stack points to try_async_pf
           Product: Virtualization
           Version: unspecified
    Kernel Version: 4.19.2
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: kvm
          Assignee: virtualization_kvm@xxxxxxxxxxxxxxxxxxxx
          Reporter: llowrey@xxxxxxxxxxxxxxxxx
        Regression: No

Created attachment 279759
  --> https://bugzilla.kernel.org/attachment.cgi?id=279759&action=edit
SysRq l

I've been getting regular freezes since kernel 4.17.19 and in all cases, the
stack traces include calls to try_async_pf. The problem started after I
"upgraded" to kernel 4.17.19. I reverted back to 4.17.12 and the problem did
not occur again. I have suffered these freezes with every kernel after 4.17.19.
I did not test kernels between 4.17.12 and 4.17.19 so I can't identify which
version introduced the problem.

The host is running 6 guests (either RHEL7 or Fedora 29). During periods of
high disk io (eg nightly backups), one or more of the guests will freeze and
the host will partially freeze. There are no error on the console.

What I mean by the host partially freezing is that existing processes seem to
operate normally (backups complete, web server continues to serve pages, etc)
but most new commands hang and ctrl-c does not kill them. I am unable to ssh in
completely. I can connect and authenticate but the connection hangs before I
get a shell prompt. Same happens when trying to log in via the console. Certain
commands like cat and echo work when used with files in /proc. But just about
anything else will hang.

I've been able to get sysrq stack dumps in many cases and I'll attach some to
this ticket. The traces all point to the guest performing an async page fault.
The host will typically have > 30GB of RAM free but also > 10GB of swap in use.
The swap device was originally on a raid 5 of SSDs but I moved it to a simple
partition on an NVMe device but the problem continued.

REPRODUCTION
1. Run several KVM guests VMs
2. Allow the guests to swap
3. Apply heavy IO load to storage
4. Excercise a VM to encourage a page fault

-- 
You are receiving this mail because:
You are watching the assignee of the bug.



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux