On 12/12/19 11:09 PM, Naohiro Aota wrote:
We have two type of I/Os during the device-replace process. One is a I/O to "copy" (by the scrub functions) all the device extents on the source device to the destination device. The other one is a I/O to "clone" (by handle_ops_on_dev_replace()) new incoming write I/Os from users to the source device into the target device. Cloning incoming I/Os can break the sequential write rule in the target device. When write is mapped in the middle of a block group, that I/O is directed in the middle of a zone of target device, which breaks the sequential write rule. However, the cloning function cannot be simply disabled since incoming I/Os targeting already copied device extents must be cloned so that the I/O is executed on the target device. We cannot use dev_replace->cursor_{left,right} to determine whether bio is going to not yet copied region. Since we have time gap between finishing btrfs_scrub_dev() and rewriting the mapping tree in btrfs_dev_replace_finishing(), we can have newly allocated device extent which is never cloned nor copied. So the point is to copy only already existing device extents. This patch introduces mark_block_group_to_copy() to mark existing block group as a target of copying. Then, handle_ops_on_dev_replace() and dev-replace can check the flag to do their job. Device-replace process in HMZONED mode must copy or clone all the extents in the source device exctly once. So, we need to use to ensure allocations started just before the dev-replace process to have their corresponding extent information in the B-trees. finish_extent_writes_for_hmzoned() implements that functionality, which basically is the removed code in the commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after error during device replace"). This patch also handles empty region between used extents. Since dev-replace is smart to copy only used extents on source device, we have to fill the gap to honor the sequential write rule in the target device. Signed-off-by: Naohiro Aota <naohiro.aota@xxxxxxx>
Can you split up the copying part and the cloning part into different patches, this is a bear to review. Also I don't quite understand the zeroout behavior. It _looks_ like for cloning you are doing a zeroout for the gap between the last wp position and the current cloned bio, which makes sense, but doesn't this gap exist because copying is ongoing? Can you copy into a zero'ed out position? Or am I missing something here? Thanks,
Josef