Re: Growing RAID10 with active XFS filesystem

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 11 Jan 2018 13:14:41 +1100

On Wed, Jan 10, 2018 at 06:17:11AM +0000, Wols Lists wrote:
> On 09/01/18 22:25, Dave Chinner wrote:
> > On Tue, Jan 09, 2018 at 09:36:49AM +0000, Wols Lists wrote:
> >> On 08/01/18 22:01, Dave Chinner wrote:
> >>> Yup, 21 devices in a RAID 10. That's a really nasty config for
> >>> RAID10 which requires an even number of disks to mirror correctly.
> >>> Why does MD even allow this sort of whacky, sub-optimal
> >>> configuration?
> >>
> >> Just to point out - if this is raid-10 (and not raid-1+0 which is a
> >> completely different beast) this is actually a normal linux config. I'm
> >> planning to set up a raid-10 across 3 devices. What happens is that is
> >> that raid-10 writes X copies across Y devices. If X = Y then it's a
> >> normal mirror config, if X > Y it makes good use of space (and if X < Y
> >> it doesn't make sense :-)
> >>
> >> SDA: 1, 2, 4, 5
> >>
> >> SDB: 1, 3, 4, 6
> >>
> >> SDC: 2, 3, 5, 6
> > 
> > It's nice to know that MD has redefined RAID-10 to be different to
> > the industry standard definition that has been used for 20 years and
> > optimised filesystem layouts for.  Rotoring data across odd numbers
> > of disks like this is going to really, really suck on filesystems
> > that are stripe layout aware..
> 
> Actually, I thought that the industry standard definition referred to
> Raid-1+0. It's just colloquially referred to as raid-10.

https://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_10

"However, a nonstandard definition of "RAID 10" was created for the
Linux MD driver"

So it's not just me who thinks what MD is doing is non-standard.

> > For example, XFS has hot-spot prevention algorithms in it's
> > internal physical layout for striped devices. It aligns AGs across
> > different stripe units so that metadata and data doesn't all get
> > aligned to the one disk in a RAID0/5/6 stripe. If the stripes are
> > rotoring across disks themselves, then we're going to end up back in
> > the same position we started with - multiple AGs aligned to the
> > same disk.
> 
> Are you telling me that xfs is aware of the internal structure of an
> md-raid array?

It's aware of the /alignment/ characteristics of block devices, and
these alignment characteristics are exported by MD. e.g.  These are
exported in /sys/block/<dev>/queue in

	minimum_io_size
		- typically the stripe chunk size
	optimal_io_size
		- typically the stripe width

We get this stuff from DM and MD devices, hardware raid (via scsi
code pages), thinp devices (i.e. to tell us the allocation
granularity so we can align/size IO to match it) and any other block
device that wants to tell us about optimal IO geometry. libblkid
provides us with this information, and it's not just mkfs.xfs that
uses it. e.g. mkfs.ext4 also uses it for the exact same purpose as
XFS....

> Given that md-raid is an abstraction layer, this seems
> rather dangerous to me - you're breaking the abstraction and this could
> explain the OP's problem. Md-raid changed underneath the filesystem, on
> the assumption that the filesystem wouldn't notice, and the filesystem
> *did*. BANG!

No, we aren't breaking any abstractions. It's always been that case
that the filesystem needs to be correctly aligned to the underlying
storage geometry if performance is desired. Think about old skool
filesystems that were aware of the old C/H/S layout of drives back
in the 80s. Optimising layouts for "cylinder groups" in the hardware
gave major performance improvements and we can trace ext4's block
group concept all the way back to those specific hardware geometry
requirements.

I suspect that the problem here is that realtively few people
understand why alignment to the underlying storage geometry is
necessary and don't realise the lengths the storage stack goes to
ensuring alignment is optimal.  It's mostly hidden and automatic
these days because most users lack the knowledge to be able to set
this sort of stuff up correctly.

> > The result is that many XFS workloads are going to hotspot disks and
> > result in unbalanced load when there are an odd number of disks in a
> > RAID-10 array.  Actually, it's probably worse than having no
> > alignment, because it makes hotspot occurrence and behaviour very
> > unpredictable.
> > 
> > Worse is the fact that there's absolutely nothing we can do to
> > optimise allocation alignment or IO behaviour at the filesystem
> > level. We'll have to make mkfs.xfs aware of this clusterfuck and
> > turn off stripe alignment when we detect such a layout, but that
> > doesn't help all the existing user installations out there right
> > now.
> 
> So you're telling me that mkfs.xfs *IS* aware of the underlying raid
> structure. OOPS! What happens when that structure changes for instance a
> raid-5 is converted to raid-6, or another disk is added?

RAID-5 to RAID-6 doesn't change the stripe alignment. That's still
N data disks per stripe, so the geometry and alignment is unchanged
and has no impact on the layout.

But changing the stripe geometry (i.e. number of data disks)
completely fucks IO alignment and that impacts overall storage
performance.  None of the existing data in the filesystem is aligned
to the underlying storage anymore so overwrites will cause all sorts
of RMW storms, you'll get IO hotspots because what used to be on
separate disks is now all on the same disk, etc. And the filesystem
won't be able to fix this because unless you increase the number
data disks by an integer multiple, the alignment cannot be changed
due to fixed locations of metadata in the filesystem.

> If you have to
> have special code to deal with md-raid and changes in said raid, where's
> the problem with more code for raid-10?

I didn't stay we had code to handle "changes in said raid". That's
explicitly what we /don't have/. To handle a geometry/alignment
change in the underlying storage we have to *resilver the entire
filesystem*. And, well, we can't easily do that because that means
we'd have to completely rewrite and re-index the filesystem. It's
faster, easier and more reliable to dump/mkfs/restore the filesystem
than it is to resilver it.

There's many, many reasons why RAID reshaping is considered harmful
and is not recommended by anyone who understands the whole storage
stack intimately.

> > IMO, odd-numbered disks in RAID-10 should be considered harmful and
> > never used....
> > 
> What about when you have an odd number of mirrors? :-)

Be a smart-ass all you want, but it doesn't change the fact that the
"grow-by-one-disk" clusterfuck occurs when you have an odd number of
mirrors, too.

> Seriously, can't you just make sure that xfs rotates the stripe units
> using a number that is relatively prime to the number of disks?

Who said we don't already rotate through stripe units?

And, well, there are situations where ignoring geometry is good
(e.g. delayed allocation allows us to pack lots of small files
together so they aggregate into full stripe writes and avoid RMW
cycles) and there are situations where stripe width rather than
stripe unit alignment is desirable for a single allocation (e.g.
large sequential direct IO writes so we avoid RMW cycles due to
partial stripe overlaps in IO).

These IO alignment optimisations are all done on-the-fly by
filesystems.  Filesystems do far more than you realise with the
geometry information they are provided with and that's why assuming
that you can transparently change the storage geometry without the
filesystem (and hence users) caring about such changes is
fundamentally wrong.

> (Just so's you know who I am, I've taken over editorship of the raid
> wiki. This is exactly the stuff that belongs on there, so as soon as I
> understand what's going on I'll write it up, and I'm happy to be
> educated :-) But I do like to really grasp what's going on, so expect
> lots of naive questions ... There's not a lot of information on how raid
> and filesystems interact, and I haven't really got to grips wioth any of
> that at the moment, and I don't use xfs. I use ext4 on gentoo, and the
> default btrfs on SUSE.)

You've got an awful lot of learning to do, then.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html