> On Jan 26, 2022, at 8:36 PM, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Thu, Jan 27, 2022, Chris Mason wrote: >> >> >>> On Jan 26, 2022, at 6:11 PM, Boris Burkov <boris@xxxxxx> wrote: >>> >>> On Wed, Jan 26, 2022 at 09:59:02PM +0000, Sean Christopherson wrote: >>>> On Wed, Jan 26, 2022, Boris Burkov wrote: >>>>> I tested this fix on the workload and it did prevent the hangs. However, >>>>> I am unsure if the fix is appropriate from a locking perspective, so I >>>>> hope to draw some extra attention to that aspect. set_page_dirty_lock in >>>>> mm/page-writeback.c has a comment about locking that says set_page_dirty >>>>> should be called with the page locked or while definitely holding a >>>>> reference to the mapping's host inode. I believe that the mmap should >>>>> have that reference, so for fear of hurting KVM performance or >>>>> introducing a deadlock, I opted for the unlocked variant. >>>> >>>> KVM doesn't hold a reference per se, but it does subscribe to mmu_notifier events >>>> and will not mark the page dirty after KVM has been instructed to unmap the page >>>> (barring bugs, which we've had a slew of). So yeah, the unlocked variant should >>>> be safe. >>>> >>>> Is it feasible to trigger this behavior in a selftest? KVM has had, and probably >>>> still has, many bugs that all boil down to KVM assuming guest memory is backed by >>>> either anonymous memory or something like shmem/HugeTLBFS/memfd that isn't typically >>>> truncated by the host. >>> >>> I haven't been able to isolate a reproducer, yet. I am a bit stumped >>> because there isn't a lot for me to go off from that stack I shared--the >>> best I have so far is that I need to trick KVM into emulating >>> instructions at some point to get to this 'complete_userspace_io' >>> codepath? I will keep trying, since I think it would be valuable to know >>> what exactly happened. Open to try any suggestions you might have as >>> well. >> >> From the btrfs side, bare calls to set_page_dirty() are suboptimal, since it >> doesn’t go through the ->page_mkwrite() dance that we use to properly COW >> things. It’s still much better than SetPageDirty(), but I’d love to >> understand why kvm needs to dirty the page so we can figure out how to go >> through the normal mmap file io paths. > > Ah, is the issue that writeback gets stuck because KVM perpetually marks the > page as dirty? The page in question should have already gone through ->page_mkwrite(). > Outside of one or two internal mmaps that KVM fully controls and are anonymous memory, > KVM doesn't modify VMAs. KVM is calling SetPageDirty() to mark that it has written > to the page; KVM either when it unmaps the page from the guest, or in this case, when > it kunmap()'s a page KVM itself accessed. > I think KVM is just calling SetPageDirty() once. The problem is that SetPageDirty() just flips the bit and doesn’t set any of the tags in the radix tree, so we can easily hit this check in filemap_fdatawrite_wbc(): if (!mapping_can_writeback(mapping) || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) return 0; Since almost everyone writing dirty pages to disk wanders through a check or search for tagged pages, the page just never gets written at all. > Based on the call stack, my best guest is that KVM is udpating steal_time info. > That's triggered when the vCPU is (re)loaded, which would explain the correlation > to complete_userspace_io() as KVM unloads=>reloads the vCPU before/after exiting > to userspace to handle emulate I/O. > > Oh! I assume that the page is either unmapped or made read-only before writeback? > v5.6 (and many kernels since) had a bug where KVM would "miss" mmu_notifier events > for the steal_time cache. It's basically a use-after-free issue at that point. Commit > 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status”) Oh, looks like we are missing that one, interesting. We use clear_page_dirty_for_io() before writing pages, so yes it does get set readonly via page_mkclean() -chris