On Wed, Feb 05, 2020 at 10:22:34AM +0100, David Hildenbrand wrote: > >> 1. Guest allocates a page and sends it to the host. > >> 2. Shrinker gets active and releases that page again. > >> 3. Some user in the guest allocates and modifies that page. The dirty bit is > >> set in the hypervisor. > > > > The bit will be set in KVM's bitmap, and will be synced to QEMU's bitmap when the next round starts. > > > >> 4. The host processes the request and clears the bit in the dirty bitmap. > > > > This clears the bit from the QEMU bitmap, and this page will not be sent in this round. > > > >> 5. The guest is stopped and the last set of dirty pages is migrated. The > >> modified page is not being migrated (because not marked dirty). > > > > When QEMU start the last round, it first syncs the bitmap from KVM, which includes the one set in step 3. > > Then the modified page gets sent. > > So, if you run a TCG guest and use it with free page reporting, the race > is possible? I'd have to look at the implementation but the basic idea is not kvm specific. The idea is that hypervisor can detect that 3 happened after 1, by means of creating a copy of the dirty bitmap when request is sent to the guest. > So the correctness depends on two dirty bitmaps in the > hypervisor and how they interact. wow this is fragile. > > Thanks for the info :) It's pretty fragile, and the annoying part is we do not actually benefit from this at all since it all only triggers in the shrinker corner case. The original idea was that we can send any hint to hypervisor and reuse the page immediately without waiting for hint to be seen. That seemed worth having, as a means to minimize impact of hinting. Then we dropped that and switched to OOM, and there not having to wait also seemed like a worthwhile thing. In the end we switched to shrinker where we can wait if we like, and many guests never even hit the shrinker so we have sacrificed robustness for nothing. If we go back to OOM then at least it's justified .. > -- > Thanks, > > David / dhildenb