Dear Peter Xu, Thank you for your feedback. > It looks like this patch will introduce a ring but still it keeps the > bitmaps around. > > Could you elaborate your motivation of this work? It’ll be interesting to > know whether you did any kind of measurement around it. First of all, I apologize for the lack of explanation. To provide more details, the motivation for this work stems from the current QEMU implementation, where pages obtained from the KVM ring are set into the KVMSlot/RAMList/RAMBlock bitmaps. Consequently, when the migration thread sends data, it ends up scanning the bitmap (resulting in O(N) time complexity). I aimed to improve this situation. Here are the steps and considerations in my implementation plan: 1. Call MigrationOps::ram_save_target_page inside kvm_dirty_ring_reap_one. The approach involves QEMU holding neither bitmaps nor rings and sending immediately. However, this would require non-migration threads (like accel threads) to send pages, necessitating a synchronization mechanism with the migration thread, which I deemed would increase code complexity. Additionally, if future non-KVM accels provided their rings, we would have to write similar code in different places, increasing future maintenance costs. For these reasons, I decided against an implementation where page sending occurs within accel/kvm and opted for a separate ring within QEMU. 2. Use a ring as an alternative to bitmaps. The approach involves implementing a ring within QEMU and inserting pages into the ring instead of setting them into bitmaps in functions like kvm_dirty_ring_mark_page, cpu_physical_memory_set_dirty_lebitmap, and cpu_physical_memory_set_dirty_range. Then, pages are retrieved from the ring in ram_find_and_save_block. However, this approach necessitates immediate sending of pages when the ring is full, which might involve non-migration threads sending pages, leading to the same issues as mentioned in step 1. Given the ring has a limited capacity, if there are enough dirty pages to fill the ring, the cost difference between operating the ring and scanning the entire bitmap would be negligible. Hence, I decided to fall back to bitmaps when the ring is full. 3. Use bitmaps when the ring is full. The approach involves setting bitmaps while simultaneously inserting pages into the ring in functions like kvm_dirty_ring_mark_page, cpu_physical_memory_set_dirty_lebitmap, and cpu_physical_memory_set_dirty_range. If the ring is not full, it is used in ram_find_and_save_block; otherwise, bitmaps are used. This way, only the migration thread sends pages. Additionally, by checking if a page is already in the ring (O(1) complexity), redundant entries are avoided. However, enqueuing and dequeuing are handled by different threads, which could result in a situation where pages exist in the bitmap but not in the ring once it is full. Identifying these pages would require locking and scanning the entire bitmap. 4. Use two rings to revert to the ring after falling back to the bitmap within a round. As mentioned earlier, once the fallback to the bitmap occurs, pages that get dirty after the ring is full cannot be captured by the ring. This would necessitate using bitmaps until the final round or periodically locking and scanning the entire bitmap to synchronize with the ring. To improve this, I considered using two rings: one for enqueueing and one for dequeuing. Pages are inserted into the enqueue ring in functions like kvm_dirty_ring_mark_page and cpu_physical_memory_set_dirty_range, and the rings are swapped in migration_sync_bitmap, with pages being retrieved in ram_find_and_save_block. This way, each call to migration_sync_bitmap (or ram_state_pending_exact) determines whether to use the ring or the bitmap in subsequent rounds. Based on this reasoning, I implemented a system that combines bitmaps and rings. Regarding performance, my local environment might be insufficient for proper measurement, but I obtained the following results by migrating after booting the latest Linux and Buildroot with an open login shell: Commands used: ``` # src sudo ./qemu-system-x86_64 \ -accel kvm,dirty-ring-size=1024 \ -m 8G \ -boot c \ -kernel ~/path/to/linux/arch/x86/boot/bzImage \ -hda ~/path/to/buildroot/output/images/rootfs.ext4 \ -append "root=/dev/sda rw console=ttyS0,115200 acpi=off" \ -nographic \ -migration dirty-logging=ring,dirty-ring-size=1024 # dst sudo ./qemu-system-x86_64 \ -accel kvm,dirty-ring-size=1024 \ -m 8G \ -boot c \ -kernel ~/path/to/linux/arch/x86/boot/bzImage \ -hda ~/path/to/buildroot/output/images/rootfs.ext4 \ -append "root=/dev/sda rw console=ttyS0,115200 acpi=off" \ -nographic \ -incoming tcp:0:4444 # hmp migrate_set_parameter max-bandwidth 1250 migrate tcp:0:4444 info migrate ``` Results for each memory size, measured 5 times: ``` # ring -m 8G total time: 418 ms total time: 416 ms total time: 415 ms total time: 416 ms total time: 416 ms # bitmap -m 8G total time: 434 ms total time: 421 ms total time: 423 ms total time: 430 ms total time: 429 ms # ring -m 16G total time: 847 ms total time: 852 ms total time: 850 ms total time: 848 ms total time: 852 ms # bitmap -m 16G total time: 860 ms total time: 862 ms total time: 858 ms total time: 859 ms total time: 861 ms # ring -m 32G total time: 1616 ms total time: 1625 ms total time: 1612 ms total time: 1612 ms total time: 1630 ms # bitmap -m 32G total time: 1714 ms total time: 1724 ms total time: 1718 ms total time: 1714 ms total time: 1714 ms # ring -m 64G total time: 3451 ms total time: 3452 ms total time: 3449 ms total time: 3451 ms total time: 3450 ms # bitmap -m 64G total time: 3550 ms total time: 3553 ms total time: 3552 ms total time: 3550 ms total time: 3553 ms # ring -m 96G total time: 5185 ms total time: 5186 ms total time: 5183 ms total time: 5191 ms total time: 5191 ms # bitmap -m 96G total time: 5385 ms total time: 5388 ms total time: 5386 ms total time: 5392 ms total time: 5592 ms ``` It is natural that the implemented ring completes migration faster for all memory sizes, given that the conditions favor the ring due to the minimal memory workload. By the way, these are total migration times, with much of the overhead attributed to page sending and other IO operations. I plan to conduct more detailed measurements going forward, but if you have any recommendations for good measurement methods, please let me know. > I remember adding such option is not suggested. We may consider using > either QMP to setup a migration parameter, or something else. I apologize for implementing this without knowledge of QEMU's policy. I will remove this option and instead implement it using migrate_set_parameter or migrate_set_capability. Is this approach acceptable? This is my first time contributing to QEMU, so I appreciate your guidance. Thank you and best regards, -- Shota Imamura 2024年6月25日(火) 4:08 Peter Xu <peterx@xxxxxxxxxx>: > > Hi, Shota, > > On Thu, Jun 20, 2024 at 06:47:13PM +0900, Shota Imamura wrote: > > This commit implements the dirty ring as an alternative dirty tracking > > method to the dirty bitmap. > > > > While the dirty ring has already been implemented in accel/kvm using KVM's > > dirty ring, it was designed to set bits in the ramlist and ramblock bitmap. > > This commit introduces a new dirty ring to replace the bitmap, allowing the > > use of the dirty ring even without KVM. When using KVM's dirty ring, this > > implementation maximizes its effectiveness. > > It looks like this patch will introduce a ring but still it keeps the > bitmaps around. > > Could you elaborate your motivation of this work? It'll be interesting to > know whether you did any kind of measurement around it. > > > > > To enable the dirty ring, specify the startup option > > "-migration dirty-logging=ring,dirty-ring-size=N". To use the bitmap, > > either specify nothing or "-migration dirty-logging=bitmap". If the dirty > > ring becomes full, it falls back to the bitmap for that round. > > I remember adding such option is not suggested. We may consider using > either QMP to setup a migration parameter, or something else. > > Thanks, > > -- > Peter Xu >