On Tue, Dec 17, 2019 at 04:05:25PM -0500, Josef Bacik wrote:
On 12/12/19 11:09 PM, Naohiro Aota wrote:
We have two type of I/Os during the device-replace process. One is a I/O to
"copy" (by the scrub functions) all the device extents on the source device
to the destination device. The other one is a I/O to "clone" (by
handle_ops_on_dev_replace()) new incoming write I/Os from users to the
source device into the target device.
Cloning incoming I/Os can break the sequential write rule in the target
device. When write is mapped in the middle of a block group, that I/O is
directed in the middle of a zone of target device, which breaks the
sequential write rule.
However, the cloning function cannot be simply disabled since incoming I/Os
targeting already copied device extents must be cloned so that the I/O is
executed on the target device.
We cannot use dev_replace->cursor_{left,right} to determine whether bio is
going to not yet copied region. Since we have time gap between finishing
btrfs_scrub_dev() and rewriting the mapping tree in
btrfs_dev_replace_finishing(), we can have newly allocated device extent
which is never cloned nor copied.
So the point is to copy only already existing device extents. This patch
introduces mark_block_group_to_copy() to mark existing block group as a
target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
check the flag to do their job.
Device-replace process in HMZONED mode must copy or clone all the extents
in the source device exctly once. So, we need to use to ensure allocations
started just before the dev-replace process to have their corresponding
extent information in the B-trees. finish_extent_writes_for_hmzoned()
implements that functionality, which basically is the removed code in the
commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after
error during device replace").
This patch also handles empty region between used extents. Since
dev-replace is smart to copy only used extents on source device, we have to
fill the gap to honor the sequential write rule in the target device.
Signed-off-by: Naohiro Aota <naohiro.aota@xxxxxxx>
Can you split up the copying part and the cloning part into different
patches, this is a bear to review. Also I don't quite understand the
zeroout behavior. It _looks_ like for cloning you are doing a zeroout
for the gap between the last wp position and the current cloned bio,
which makes sense, but doesn't this gap exist because copying is
ongoing? Can you copy into a zero'ed out position? Or am I missing
something here? Thanks,
Josef
OK, I will split this in the next version. (but, it's mostly "copying"
part)
Let me clarify first that I am using "copying" for copying existing
extents to the new device and "cloning" for cloning a new incoming BIO
to the new device.
For zeroout, it is for "copying" which is done with the scrub code to
copy existing extents on the source devie to the destination
device. Since copying or scrub only scans for living extents, there
can be a gap between two living extents. So, we need to fill a gap
with zeroout to make the writing stream sequential.
And "cloning" is only done for new block groups or already fully
copied block groups. So there is no gaps for them because the
allocator and the IO locks ensures the sequential allocation and
submit.
Thanks,