On 6/26/23 8:35 PM, Song Liu wrote:
On Fri, Jun 23, 2023 at 11:05 AM Jason Baron <jbaron@xxxxxxxxxx> wrote:
We've found that using raid0 with the 'original' layout and discard
enabled with different disk sizes (such that at least two zones are
created) can result in data corruption. This is due to the fact that
the discard handling in 'raid0_handle_discard()' assumes the 'alternate'
layout. We've seen this corruption using ext4 but other filesystems are
likely susceptible as well.
More specifically, while multiple zones are necessary to create the
corruption, the corruption may not occur with multiple zones if they
layout in such a way the layout matches what the 'alternate' layout
would have produced. Thus, not all raid0 devices with the 'original'
layout, different size disks and discard enabled will encounter this
corruption.
The 3.14 kernel inadvertently changed the raid0 disk layout for different
size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
same raid0 array could corrupt data. This lead to the creation of the
'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
(to match the post 3.14 layout) in the 5.4 kernel time frame and an option
to tell the kernel which layout to use (since it couldn't be autodetected).
However, when the 'original' layout was added back to 5.4 discard support
for the 'original' layout was not added leading this issue.
I've been able to reliably reproduce the corruption with the following
test case:
1. create raid0 array with different size disks using original layout
2. mkfs
3. mount -o discard
4. create lots of files
5. remove 1/2 the files
6. fstrim -a (or just the mount point for the raid0 array)
7. umount
8. fsck -fn /dev/md0 (spews all sorts of corruptions)
Let's fix this by adding proper discard support to the 'original' layout.
The fix 'maps' the 'original' layout disks to the order in which they are
read/written such that we can compare the disks in the same way that the
current 'alternate' layout does. A 'disk_shift' field is added to
'struct strip_zone'. This could be computed on the fly in
raid0_handle_discard() but by adding this field, we save some computation
in the discard path.
Note we could also potentially fix this by re-ordering the disks in the
zones that follow the first one, and then always read/writing them using
the 'alternate' layout. However, that is seen as a more substantial change,
and we are attempting the least invasive fix at this time to remedy the
corruption.
I've verified the change using the reproducer mentioned above. Typically,
the corruption is seen after less than 3 iterations, while the patch has
run 500+ iterations.
Cc: NeilBrown <neilb@xxxxxxx>
Cc: Song Liu <song@xxxxxxxxxx>
Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Signed-off-by: Jason Baron <jbaron@xxxxxxxxxx>
Looks good to me! Applied to md-next.
Since this will be released with 6.6, we should have a smaller and safer fix
before that. Would you mind create a patch that ignores all discards to
orig_layout and not the first zone? We will roll that to 6.5 and back port to
stable. Then this version will be shipped to 6.6+.
Hi Song,
Ok, I mean the current patch was meant to be fairly conservative in that
it attempts to only change codepaths where we are doing discards above
the first zone. IE Changing only the codepaths that currently don't work.
But if we want to be more conservative (given this fixes corruption), I
can post a patch to disable discard as you've suggested. I'm going to
let the testing run for a while, so I'll post it in a bit.
Thanks,
-Jason