On Fri, Jun 23, 2023 at 11:05 AM Jason Baron <jbaron@xxxxxxxxxx> wrote: > > We've found that using raid0 with the 'original' layout and discard > enabled with different disk sizes (such that at least two zones are > created) can result in data corruption. This is due to the fact that > the discard handling in 'raid0_handle_discard()' assumes the 'alternate' > layout. We've seen this corruption using ext4 but other filesystems are > likely susceptible as well. > > More specifically, while multiple zones are necessary to create the > corruption, the corruption may not occur with multiple zones if they > layout in such a way the layout matches what the 'alternate' layout > would have produced. Thus, not all raid0 devices with the 'original' > layout, different size disks and discard enabled will encounter this > corruption. > > The 3.14 kernel inadvertently changed the raid0 disk layout for different > size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the > same raid0 array could corrupt data. This lead to the creation of the > 'original' layout (to match the pre-3.14 layout) and the 'alternate' layout > (to match the post 3.14 layout) in the 5.4 kernel time frame and an option > to tell the kernel which layout to use (since it couldn't be autodetected). > However, when the 'original' layout was added back to 5.4 discard support > for the 'original' layout was not added leading this issue. > > I've been able to reliably reproduce the corruption with the following > test case: > > 1. create raid0 array with different size disks using original layout > 2. mkfs > 3. mount -o discard > 4. create lots of files > 5. remove 1/2 the files > 6. fstrim -a (or just the mount point for the raid0 array) > 7. umount > 8. fsck -fn /dev/md0 (spews all sorts of corruptions) > > Let's fix this by adding proper discard support to the 'original' layout. > The fix 'maps' the 'original' layout disks to the order in which they are > read/written such that we can compare the disks in the same way that the > current 'alternate' layout does. A 'disk_shift' field is added to > 'struct strip_zone'. This could be computed on the fly in > raid0_handle_discard() but by adding this field, we save some computation > in the discard path. > > Note we could also potentially fix this by re-ordering the disks in the > zones that follow the first one, and then always read/writing them using > the 'alternate' layout. However, that is seen as a more substantial change, > and we are attempting the least invasive fix at this time to remedy the > corruption. > > I've verified the change using the reproducer mentioned above. Typically, > the corruption is seen after less than 3 iterations, while the patch has > run 500+ iterations. > > Cc: NeilBrown <neilb@xxxxxxx> > Cc: Song Liu <song@xxxxxxxxxx> > Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.") > Signed-off-by: Jason Baron <jbaron@xxxxxxxxxx> Looks good to me! Applied to md-next. Since this will be released with 6.6, we should have a smaller and safer fix before that. Would you mind create a patch that ignores all discards to orig_layout and not the first zone? We will roll that to 6.5 and back port to stable. Then this version will be shipped to 6.6+. Thanks, Song