Re: [PATCH 1/2] migration: Implement dirty ring

Peter Xu <peterx@xxxxxxxxxx> · Tue, 25 Jun 2024 17:51:19 -0400

On Tue, Jun 25, 2024 at 08:10:23PM +0900, Shota Imamura wrote:
> Dear Peter Xu,
> 
> Thank you for your feedback.
> 
> > It looks like this patch will introduce a ring but still it keeps the
> > bitmaps around.
> >
> > Could you elaborate your motivation of this work? It’ll be interesting to
> > know whether you did any kind of measurement around it.
> 
> First of all, I apologize for the lack of explanation.
> To provide more details, the motivation for this work stems from the
> current QEMU implementation, where pages obtained from the KVM ring are set
> into the KVMSlot/RAMList/RAMBlock bitmaps. Consequently, when the migration
> thread sends data, it ends up scanning the bitmap (resulting in O(N) time
> complexity). I aimed to improve this situation.

So is this a research project?  Or do you have explicit goals, e.g. on
reducing migration time, or make migration easier to converge?

These information might be helpful for reviewers to understand the ultimate
goal of this series.

> 
> Here are the steps and considerations in my implementation plan:
> 
> 1. Call MigrationOps::ram_save_target_page inside kvm_dirty_ring_reap_one.
> 
> The approach involves QEMU holding neither bitmaps nor rings and sending
> immediately. However, this would require non-migration threads (like accel
> threads) to send pages, necessitating a synchronization mechanism with the
> migration thread, which I deemed would increase code complexity.
> Additionally, if future non-KVM accels provided their rings, we would have
> to write similar code in different places, increasing future maintenance
> costs. For these reasons, I decided against an implementation where page
> sending occurs within accel/kvm and opted for a separate ring within QEMU.

Yes this won't trivially work cleanly, as ram_save_target_page() requires
migration context.  It may also not be thread-safe too.  E.g. qemufile
isn't thread-safe.

"maybe" it'll be easier to do it the other way round: allowing migration to
fetch entry from a ring by calling a registered ring op.

> 
> 2. Use a ring as an alternative to bitmaps.
> 
> The approach involves implementing a ring within QEMU and inserting pages
> into the ring instead of setting them into bitmaps in functions like
> kvm_dirty_ring_mark_page, cpu_physical_memory_set_dirty_lebitmap, and
> cpu_physical_memory_set_dirty_range. Then, pages are retrieved from the
> ring in ram_find_and_save_block.

(I think this might be what I mentioned above)

> However, this approach necessitates immediate sending of pages when the
> ring is full, which might involve non-migration threads sending pages,
> leading to the same issues as mentioned in step 1. Given the ring has a
> limited capacity, if there are enough dirty pages to fill the ring, the
> cost difference between operating the ring and scanning the entire bitmap
> would be negligible.  Hence, I decided to fall back to bitmaps when the
> ring is full.

Right, that was also what I found - the ring may not work as well when the
guest is very busy.  Here the question is migration is normally more
challenging when that is the case.. and when with a pretty idle guest it's
easier to migrate anyway even with bitmap-only.  I think I noticed that
pretty late.

> 
> 3. Use bitmaps when the ring is full.
> 
> The approach involves setting bitmaps while simultaneously inserting pages
> into the ring in functions like kvm_dirty_ring_mark_page,
> cpu_physical_memory_set_dirty_lebitmap, and
> cpu_physical_memory_set_dirty_range. If the ring is not full, it is used in
> ram_find_and_save_block; otherwise, bitmaps are used. This way, only the
> migration thread sends pages. Additionally, by checking if a page is
> already in the ring (O(1) complexity), redundant entries are avoided.
> However, enqueuing and dequeuing are handled by different threads, which
> could result in a situation where pages exist in the bitmap but not in the
> ring once it is full. Identifying these pages would require locking and
> scanning the entire bitmap.

Yes, is this series using this approach?

I think the ring effect may be minimum if there is the bitmap already,
because as long as the ring can fallback (to avoid hanging a vcpu for a
long time) it means the bitmap can contain useful data and the bitmap scans
will be needed, then it kind of invalidates the ring's benefit, IMHO.

What I was thinking a long time ago was something like this: we only use
ring, as you said in (2) above, so no bitmap.  Then the migration core can
look like this:

  migration_thread():
    count=0
    while () {
        count++
        if ((count % ENFORCE_BACKGROUND_COUNT == 0) ||
            ring_empty())
            send_page(background.pop())
        else
            send_page(ring.pop())
    }

To explain it a bit:

  - we will have two page iterators:

    - background.pop() means we pop one background page.  Here since we
    don't have a bitmap, we can't scan anything, the background page will
    be an iterator that will migrate the guest memory from page 0 to page
    N-1 (considering there's N guest pages), once and for all.  It means we
    pop guest page 0, then 1, ... until N-1, then it's all done.  After
    this iteration, nothing will be needed in background, all dirty will
    only reside in the ring.

    - ring.pop() means we fetch one entry from the ring.

  - here ENFORCE_BACKGROUND_COUNT will guarantee the background work can
    always make some progress over time, so that at some point it'll flush
    empty and only ring entries left.  Otherwise if we do things like "if
    (!ring_empty()) send_page(ring.pop())" it can happen that we keep
    sending ring pages but we never sent the initial round of RAMs, so it
    will never completes.

Then the background migration doesn't need to keep iterating but only
happen once, and then we know we've finished.

That will be a totally different way to describe migration.  I don't know
whether it'll make sense at all.  There're a lot of hard questions to
answer, e.g. besides MIGRATION we also have TCG/VGA bitmap consumers, and
we need to know how to handle them.  It's also just challenging to remove
all bitmap layers because there're just a huge lot of them in QEMU..

I'm not sure whether above would help at all, but let me still put this
here, maybe it'll let you think about something useful.

> 
> 4. Use two rings to revert to the ring after falling back to the bitmap
> within a round.
> 
> As mentioned earlier, once the fallback to the bitmap occurs, pages that
> get dirty after the ring is full cannot be captured by the ring. This would
> necessitate using bitmaps until the final round or periodically locking and
> scanning the entire bitmap to synchronize with the ring. To improve this, I
> considered using two rings: one for enqueueing and one for dequeuing. Pages
> are inserted into the enqueue ring in functions like
> kvm_dirty_ring_mark_page and cpu_physical_memory_set_dirty_range, and the
> rings are swapped in migration_sync_bitmap, with pages being retrieved in
> ram_find_and_save_block. This way, each call to migration_sync_bitmap (or
> ram_state_pending_exact) determines whether to use the ring or the bitmap
> in subsequent rounds.
> 
> Based on this reasoning, I implemented a system that combines bitmaps and
> rings.
> 
> Regarding performance, my local environment might be insufficient for
> proper measurement, but I obtained the following results by migrating after
> booting the latest Linux and Buildroot with an open login shell:
> 
> Commands used:
> ```
> # src
> sudo ./qemu-system-x86_64 \
> -accel kvm,dirty-ring-size=1024 \
> -m 8G \
> -boot c \
> -kernel ~/path/to/linux/arch/x86/boot/bzImage \
> -hda ~/path/to/buildroot/output/images/rootfs.ext4 \
> -append "root=/dev/sda rw console=ttyS0,115200 acpi=off" \
> -nographic \
> -migration dirty-logging=ring,dirty-ring-size=1024
> 
> # dst
> sudo ./qemu-system-x86_64 \
> -accel kvm,dirty-ring-size=1024 \
> -m 8G \
> -boot c \
> -kernel ~/path/to/linux/arch/x86/boot/bzImage \
> -hda ~/path/to/buildroot/output/images/rootfs.ext4 \
> -append "root=/dev/sda rw console=ttyS0,115200 acpi=off" \
> -nographic \
> -incoming tcp:0:4444
> 
> # hmp
> migrate_set_parameter max-bandwidth 1250
> migrate tcp:0:4444
> info migrate
> ```
> 
> Results for each memory size, measured 5 times:
> ```
> # ring -m 8G
> total time: 418 ms
> total time: 416 ms
> total time: 415 ms
> total time: 416 ms
> total time: 416 ms
> 
> # bitmap -m 8G
> total time: 434 ms
> total time: 421 ms
> total time: 423 ms
> total time: 430 ms
> total time: 429 ms
> 
> # ring -m 16G
> total time: 847 ms
> total time: 852 ms
> total time: 850 ms
> total time: 848 ms
> total time: 852 ms
> 
> # bitmap -m 16G
> total time: 860 ms
> total time: 862 ms
> total time: 858 ms
> total time: 859 ms
> total time: 861 ms
> 
> # ring -m 32G
> total time: 1616 ms
> total time: 1625 ms
> total time: 1612 ms
> total time: 1612 ms
> total time: 1630 ms
> 
> # bitmap -m 32G
> total time: 1714 ms
> total time: 1724 ms
> total time: 1718 ms
> total time: 1714 ms
> total time: 1714 ms
> 
> # ring -m 64G
> total time: 3451 ms
> total time: 3452 ms
> total time: 3449 ms
> total time: 3451 ms
> total time: 3450 ms
> 
> # bitmap -m 64G
> total time: 3550 ms
> total time: 3553 ms
> total time: 3552 ms
> total time: 3550 ms
> total time: 3553 ms
> 
> # ring -m 96G
> total time: 5185 ms
> total time: 5186 ms
> total time: 5183 ms
> total time: 5191 ms
> total time: 5191 ms
> 
> # bitmap -m 96G
> total time: 5385 ms
> total time: 5388 ms
> total time: 5386 ms
> total time: 5392 ms
> total time: 5592 ms
> ```
> 
> It is natural that the implemented ring completes migration faster for all
> memory sizes, given that the conditions favor the ring due to the minimal
> memory workload. By the way, these are total migration times, with much of
> the overhead attributed to page sending and other IO operations.

It's interesting to know there's already a perf difference.  I wonder how
that happened, though, if the bitmap is still there, and it looks like the
scan will still be needed anyway.

I didn't look where a ring full event in kvm can imply halting the vcpu
threads, if with that, and if the vcpu is dirtying something it can even
dirty it slower, it can contributes to a less total dirty pages to move.
Then maybe the total migration time is reduced because of that.  But I'm
not sure.

> 
> I plan to conduct more detailed measurements going forward, but if you have
> any recommendations for good measurement methods, please let me know.
> 
> > I remember adding such option is not suggested. We may consider using
> > either QMP to setup a migration parameter, or something else.
> 
> I apologize for implementing this without knowledge of QEMU's policy.
> I will remove this option and instead implement it using
> migrate_set_parameter or migrate_set_capability. Is this approach
> acceptable?
> 
> This is my first time contributing to QEMU, so I appreciate your guidance.

Not sure how you nail this, but IMHO this can be a very research-level
projects.  I'm not sure what's your goal, if it's about contributing to
qemu there can be more solid problems to solve.  Let me know if it is the
case. At least I am not crystal clear on how to best leverage the rings
yet, I guess, even though it feels like a pretty neat design.

Thanks,

-- 
Peter Xu