On Wed, Jun 28, 2023 at 3:43 PM Song Liu <song@xxxxxxxxxx> wrote: > > Hi Jason, > > Thanks for the fix! > > On Wed, Jun 28, 2023 at 7:32 AM Jason Baron <jbaron@xxxxxxxxxx> wrote: > > > > We've found that using raid0 with the 'original' layout and discard > > enabled with different disk sizes (such that at least two zones are > > created) can result in data corruption. This is due to the fact that > > the discard handling in 'raid0_handle_discard()' assumes the 'alternate' > > layout. We've seen this corruption using ext4 but other filesystems are > > likely susceptible as well. > > > > More specifically, while multiple zones are necessary to create the > > corruption, the corruption may not occur with multiple zones if they > > layout in such a way the layout matches what the 'alternate' layout > > would have produced. Thus, not all raid0 devices with the 'original' > > layout, different size disks and discard enabled will encounter this > > corruption. > > > > The 3.14 kernel inadvertently changed the raid0 disk layout for different > > size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the > > same raid0 array could corrupt data. This lead to the creation of the > > 'original' layout (to match the pre-3.14 layout) and the 'alternate' layout > > (to match the post 3.14 layout) in the 5.4 kernel time frame and an option > > to tell the kernel which layout to use (since it couldn't be autodetected). > > However, when the 'original' layout was added back to 5.4 discard support > > for the 'original' layout was not added leading this issue. > > > > I've been able to reliably reproduce the corruption with the following > > test case: > > > > 1. create raid0 array with different size disks using original layout > > 2. mkfs > > 3. mount -o discard > > 4. create lots of files > > 5. remove 1/2 the files > > 6. fstrim -a (or just the mount point for the raid0 array) > > 7. umount > > 8. fsck -fn /dev/md0 (spews all sorts of corruptions) > > > > The fix here is a minimal fix intended for stable trees which doesn't do > > discard if we have the original layout and we are not in first zone or > > if the i/o crosses zones (we either do the entire discard or none of it). > > The proper fix to actually perform discard to all zones for the original > > layout will land in upstream versions. We have implemented the minimal > > fix here for stable branches to reduce risk. > > > > I've verified the change using the reproducer mentioned above. Typically, > > the corruption is seen after less than 3 iterations, while the patch has > > run 500+ iterations. > > > > Cc: NeilBrown <neilb@xxxxxxx> > > Cc: Song Liu <song@xxxxxxxxxx> > > Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.") > > Signed-off-by: Jason Baron <jbaron@xxxxxxxxxx> > > --- > > drivers/md/raid0.c | 17 ++++++++++++++++- > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c > > index f8ee9a95e25d..2713a4acb44f 100644 > > --- a/drivers/md/raid0.c > > +++ b/drivers/md/raid0.c > > @@ -444,10 +444,25 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio) > > sector_t end_disk_offset; > > unsigned int end_disk_index; > > unsigned int disk; > > + bool bio_zone_overlap = false; > > > > zone = find_zone(conf, &start); > > + if (bio_end_sector(bio) > zone->zone_end) > > + bio_zone_overlap = true; > > + > > + /* The discard code below doesn't properly support the original > > + * layout for the zones above the first one. We are adding > > + * proper support in later kernel versions but have decided > > + * that dropping discard support here is the lower risk option > > + * to avoid data corruption for stable versions. > > + */ > > + if ((conf->layout == RAID0_ORIG_LAYOUT) && > > + ((zone != conf->strip_zone) || (bio_zone_overlap))) { > > + bio_endio(bio); > > + return; > > + } > > For bio_zone_overlap case, I think we can still do the split below and > send discard to the first zone? Actually, on a second (third?) thought, I think we can just ship the other version (20230623180523.1901230-1-jbaron@xxxxxxxxxx) to stable kernel. It is very safe. So we don't need more work on this version. Sorry for the confusion. Thanks again for the fix! Song